All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] tmpfs: HUGEPAGE and MEM_LOCK fcntls and memfds
@ 2021-07-30  7:22 ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

A series of HUGEPAGE and MEM_LOCK tmpfs fcntls and memfd_create flags,
after fixes (not essential for stable) and cleanups in related areas.

Against 5.14-rc3: currently no conflict with linux-next or mmotm.

01/16 huge tmpfs: fix fallocate(vanilla) advance over huge pages
02/16 huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
03/16 huge tmpfs: remove shrinklist addition from shmem_setattr()
04/16 huge tmpfs: revert shmem's use of transhuge_vma_enabled()
05/16 huge tmpfs: move shmem_huge_enabled() upwards
06/16 huge tmpfs: shmem_is_huge(vma, inode, index)
07/16 memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
08/16 huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
09/16 huge tmpfs: decide stat.st_blksize by shmem_is_huge()
10/16 tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
11/16 tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked
12/16 tmpfs: refuse memlock when fallocated beyond i_size
13/16 mm: bool user_shm_lock(loff_t size, struct ucounts *)
14/16 mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc)
15/16 tmpfs: permit changing size of memlocked file
16/16 memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem

 fs/fcntl.c                 |    8 
 fs/hugetlbfs/inode.c       |    4 
 include/linux/mm.h         |    4 
 include/linux/shmem_fs.h   |   31 ++-
 include/uapi/linux/fcntl.h |   17 +
 include/uapi/linux/memfd.h |    4 
 ipc/shm.c                  |    4 
 mm/huge_memory.c           |    6 
 mm/khugepaged.c            |    2 
 mm/memfd.c                 |   27 ++
 mm/mlock.c                 |   19 -
 mm/shmem.c                 |  397 ++++++++++++++++++++++++++-------------
 12 files changed, 370 insertions(+), 153 deletions(-)

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 00/16] tmpfs: HUGEPAGE and MEM_LOCK fcntls and memfds
@ 2021-07-30  7:22 ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

A series of HUGEPAGE and MEM_LOCK tmpfs fcntls and memfd_create flags,
after fixes (not essential for stable) and cleanups in related areas.

Against 5.14-rc3: currently no conflict with linux-next or mmotm.

01/16 huge tmpfs: fix fallocate(vanilla) advance over huge pages
02/16 huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
03/16 huge tmpfs: remove shrinklist addition from shmem_setattr()
04/16 huge tmpfs: revert shmem's use of transhuge_vma_enabled()
05/16 huge tmpfs: move shmem_huge_enabled() upwards
06/16 huge tmpfs: shmem_is_huge(vma, inode, index)
07/16 memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
08/16 huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
09/16 huge tmpfs: decide stat.st_blksize by shmem_is_huge()
10/16 tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
11/16 tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked
12/16 tmpfs: refuse memlock when fallocated beyond i_size
13/16 mm: bool user_shm_lock(loff_t size, struct ucounts *)
14/16 mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc)
15/16 tmpfs: permit changing size of memlocked file
16/16 memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem

 fs/fcntl.c                 |    8 
 fs/hugetlbfs/inode.c       |    4 
 include/linux/mm.h         |    4 
 include/linux/shmem_fs.h   |   31 ++-
 include/uapi/linux/fcntl.h |   17 +
 include/uapi/linux/memfd.h |    4 
 ipc/shm.c                  |    4 
 mm/huge_memory.c           |    6 
 mm/khugepaged.c            |    2 
 mm/memfd.c                 |   27 ++
 mm/mlock.c                 |   19 -
 mm/shmem.c                 |  397 ++++++++++++++++++++++++++-------------
 12 files changed, 370 insertions(+), 153 deletions(-)

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:25   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

shmem_fallocate() goes to a lot of trouble to leave its newly allocated
pages !Uptodate, partly to identify and undo them on failure, partly to
leave the overhead of clearing them until later.  But the huge page case
did not skip to the end of the extent, walked through the tail pages one
by one, and appeared to work just fine: but in doing so, cleared and
Uptodated the huge page, so there was no way to undo it on failure.

Now advance immediately to the end of the huge extent, with a comment on
why this is more than just an optimization.  But although this speeds up
huge tmpfs fallocation, it does leave the clearing until first use, and
some users may have come to appreciate slow fallocate but fast first use:
if they complain, then we can consider adding a pass to clear at the end.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 70d9ce294bb4..0cd5c9156457 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ) {
 		struct page *page;
 
 		/*
@@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto undone;
 		}
 
+		index++;
+		/*
+		 * Here is a more important optimization than it appears:
+		 * a second SGP_FALLOC on the same huge page will clear it,
+		 * making it PageUptodate and un-undoable if we fail later.
+		 */
+		if (PageTransCompound(page)) {
+			index = round_up(index, HPAGE_PMD_NR);
+			/* Beware 32-bit wraparound */
+			if (!index)
+				index--;
+		}
+
 		/*
 		 * Inform shmem_writepage() how far we have reached.
 		 * No need for lock or barrier: we have the page lock.
 		 */
-		shmem_falloc.next++;
 		if (!PageUptodate(page))
-			shmem_falloc.nr_falloced++;
+			shmem_falloc.nr_falloced += index - shmem_falloc.next;
+		shmem_falloc.next = index;
 
 		/*
 		 * If !PageUptodate, leave it that way so that freeable pages
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
@ 2021-07-30  7:25   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

shmem_fallocate() goes to a lot of trouble to leave its newly allocated
pages !Uptodate, partly to identify and undo them on failure, partly to
leave the overhead of clearing them until later.  But the huge page case
did not skip to the end of the extent, walked through the tail pages one
by one, and appeared to work just fine: but in doing so, cleared and
Uptodated the huge page, so there was no way to undo it on failure.

Now advance immediately to the end of the huge extent, with a comment on
why this is more than just an optimization.  But although this speeds up
huge tmpfs fallocation, it does leave the clearing until first use, and
some users may have come to appreciate slow fallocate but fast first use:
if they complain, then we can consider adding a pass to clear at the end.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 70d9ce294bb4..0cd5c9156457 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
-	for (index = start; index < end; index++) {
+	for (index = start; index < end; ) {
 		struct page *page;
 
 		/*
@@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			goto undone;
 		}
 
+		index++;
+		/*
+		 * Here is a more important optimization than it appears:
+		 * a second SGP_FALLOC on the same huge page will clear it,
+		 * making it PageUptodate and un-undoable if we fail later.
+		 */
+		if (PageTransCompound(page)) {
+			index = round_up(index, HPAGE_PMD_NR);
+			/* Beware 32-bit wraparound */
+			if (!index)
+				index--;
+		}
+
 		/*
 		 * Inform shmem_writepage() how far we have reached.
 		 * No need for lock or barrier: we have the page lock.
 		 */
-		shmem_falloc.next++;
 		if (!PageUptodate(page))
-			shmem_falloc.nr_falloced++;
+			shmem_falloc.nr_falloced += index - shmem_falloc.next;
+		shmem_falloc.next = index;
 
 		/*
 		 * If !PageUptodate, leave it that way so that freeable pages
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:28   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses
of split_huge_page() near i_size.

It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags.  The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.

Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h | 13 +++++++++++++
 mm/huge_memory.c         |  6 ++++--
 mm/shmem.c               | 15 ++++++++++++++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..9b7f7ac52351 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -18,6 +18,7 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
+	pgoff_t			fallocend;	/* highest fallocate endindex */
 	struct list_head        shrinklist;     /* shrinkable hpage inodes */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
@@ -119,6 +120,18 @@ static inline bool shmem_file(struct file *file)
 	return shmem_mapping(file->f_mapping);
 }
 
+/*
+ * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
+ * beyond i_size's notion of EOF, which fallocate has committed to reserving:
+ * which split_huge_page() must therefore not delete.  This use of a single
+ * "fallocend" per inode errs on the side of not deleting a reservation when
+ * in doubt: there are plenty of cases when it preserves unreserved pages.
+ */
+static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
+{
+	return max(eof, SHMEM_I(inode)->fallocend);
+}
+
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..890fb73ac89b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2454,11 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
-		/* Some pages can be beyond i_size: drop them from page cache */
+		/* Some pages can be beyond EOF: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
 			__delete_from_page_cache(head + i, NULL);
-			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
+			if (shmem_mapping(head->mapping))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
@@ -2686,6 +2686,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * head page lock is good enough to serialize the trimming.
 		 */
 		end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+		if (shmem_mapping(mapping))
+			end = shmem_fallocend(mapping->host, end);
 	}
 
 	/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 0cd5c9156457..24c9da6b41c2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -905,6 +905,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
+	if (info->fallocend > start && info->fallocend <= end && !unfalloc)
+		info->fallocend = start;
+
 	pagevec_init(&pvec);
 	index = start;
 	while (index < end && find_lock_entries(mapping, index, end - 1,
@@ -2667,7 +2670,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
-	pgoff_t start, index, end;
+	pgoff_t start, index, end, undo_fallocend;
 	int error;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2736,6 +2739,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
+	/*
+	 * info->fallocend is only relevant when huge pages might be
+	 * involved: to prevent split_huge_page() freeing fallocated
+	 * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
+	 */
+	undo_fallocend = info->fallocend;
+	if (info->fallocend < end)
+		info->fallocend = end;
+
 	for (index = start; index < end; ) {
 		struct page *page;
 
@@ -2750,6 +2762,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		else
 			error = shmem_getpage(inode, index, &page, SGP_FALLOC);
 		if (error) {
+			info->fallocend = undo_fallocend;
 			/* Remove the !PageUptodate pages we added */
 			if (index > start) {
 				shmem_undo_range(inode,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
@ 2021-07-30  7:28   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

A successful shmem_fallocate() guarantees that the extent has been
reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
split huge pages and free their excess beyond i_size; and by other uses
of split_huge_page() near i_size.

It's sad to add a shmem inode field just for this, but I did not find a
better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
would be cheaper, but I'm averse to unclearable flags.  The fallocend
field is not perfect either (many disjoint ranges might be fallocated),
but good enough; and gains another use later on.

Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h | 13 +++++++++++++
 mm/huge_memory.c         |  6 ++++--
 mm/shmem.c               | 15 ++++++++++++++-
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 8e775ce517bb..9b7f7ac52351 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -18,6 +18,7 @@ struct shmem_inode_info {
 	unsigned long		flags;
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
+	pgoff_t			fallocend;	/* highest fallocate endindex */
 	struct list_head        shrinklist;     /* shrinkable hpage inodes */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
@@ -119,6 +120,18 @@ static inline bool shmem_file(struct file *file)
 	return shmem_mapping(file->f_mapping);
 }
 
+/*
+ * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
+ * beyond i_size's notion of EOF, which fallocate has committed to reserving:
+ * which split_huge_page() must therefore not delete.  This use of a single
+ * "fallocend" per inode errs on the side of not deleting a reservation when
+ * in doubt: there are plenty of cases when it preserves unreserved pages.
+ */
+static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
+{
+	return max(eof, SHMEM_I(inode)->fallocend);
+}
+
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index afff3ac87067..890fb73ac89b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2454,11 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
-		/* Some pages can be beyond i_size: drop them from page cache */
+		/* Some pages can be beyond EOF: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
 			__delete_from_page_cache(head + i, NULL);
-			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
+			if (shmem_mapping(head->mapping))
 				shmem_uncharge(head->mapping->host, 1);
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
@@ -2686,6 +2686,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * head page lock is good enough to serialize the trimming.
 		 */
 		end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
+		if (shmem_mapping(mapping))
+			end = shmem_fallocend(mapping->host, end);
 	}
 
 	/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 0cd5c9156457..24c9da6b41c2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -905,6 +905,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
+	if (info->fallocend > start && info->fallocend <= end && !unfalloc)
+		info->fallocend = start;
+
 	pagevec_init(&pvec);
 	index = start;
 	while (index < end && find_lock_entries(mapping, index, end - 1,
@@ -2667,7 +2670,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
-	pgoff_t start, index, end;
+	pgoff_t start, index, end, undo_fallocend;
 	int error;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2736,6 +2739,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = &shmem_falloc;
 	spin_unlock(&inode->i_lock);
 
+	/*
+	 * info->fallocend is only relevant when huge pages might be
+	 * involved: to prevent split_huge_page() freeing fallocated
+	 * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
+	 */
+	undo_fallocend = info->fallocend;
+	if (info->fallocend < end)
+		info->fallocend = end;
+
 	for (index = start; index < end; ) {
 		struct page *page;
 
@@ -2750,6 +2762,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		else
 			error = shmem_getpage(inode, index, &page, SGP_FALLOC);
 		if (error) {
+			info->fallocend = undo_fallocend;
 			/* Remove the !PageUptodate pages we added */
 			if (index > start) {
 				shmem_undo_range(inode,
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr()
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:30   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

There's a block of code in shmem_setattr() to add the inode to
shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
from before 5.7 changed truncation to do split_huge_page() for itself,
and should have been removed at that time.

I am over-stating that: split_huge_page() can fail (notably if there's
an extra reference to the page at that time), so there might be value in
retrying.  But there were already retries as truncation worked through
the tails, and this addition risks repeating unsuccessful retries
indefinitely: I'd rather remove it now, and work on reducing the
chance of split_huge_page() failures separately, if we need to.

Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 24c9da6b41c2..ce3ccaac54d6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1061,7 +1061,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(dentry);
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	int error;
 
 	error = setattr_prepare(&init_user_ns, dentry, attr);
@@ -1097,24 +1096,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 			if (oldsize > holebegin)
 				unmap_mapping_range(inode->i_mapping,
 							holebegin, 0, 1);
-
-			/*
-			 * Part of the huge page can be beyond i_size: subject
-			 * to shrink under memory pressure.
-			 */
-			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-				spin_lock(&sbinfo->shrinklist_lock);
-				/*
-				 * _careful to defend against unlocked access to
-				 * ->shrink_list in shmem_unused_huge_shrink()
-				 */
-				if (list_empty_careful(&info->shrinklist)) {
-					list_add_tail(&info->shrinklist,
-							&sbinfo->shrinklist);
-					sbinfo->shrinklist_len++;
-				}
-				spin_unlock(&sbinfo->shrinklist_lock);
-			}
 		}
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr()
@ 2021-07-30  7:30   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

There's a block of code in shmem_setattr() to add the inode to
shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
from before 5.7 changed truncation to do split_huge_page() for itself,
and should have been removed at that time.

I am over-stating that: split_huge_page() can fail (notably if there's
an extra reference to the page at that time), so there might be value in
retrying.  But there were already retries as truncation worked through
the tails, and this addition risks repeating unsuccessful retries
indefinitely: I'd rather remove it now, and work on reducing the
chance of split_huge_page() failures separately, if we need to.

Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 24c9da6b41c2..ce3ccaac54d6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1061,7 +1061,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = d_inode(dentry);
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	int error;
 
 	error = setattr_prepare(&init_user_ns, dentry, attr);
@@ -1097,24 +1096,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 			if (oldsize > holebegin)
 				unmap_mapping_range(inode->i_mapping,
 							holebegin, 0, 1);
-
-			/*
-			 * Part of the huge page can be beyond i_size: subject
-			 * to shrink under memory pressure.
-			 */
-			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-				spin_lock(&sbinfo->shrinklist_lock);
-				/*
-				 * _careful to defend against unlocked access to
-				 * ->shrink_list in shmem_unused_huge_shrink()
-				 */
-				if (list_empty_careful(&info->shrinklist)) {
-					list_add_tail(&info->shrinklist,
-							&sbinfo->shrinklist);
-					sbinfo->shrinklist_len++;
-				}
-				spin_unlock(&sbinfo->shrinklist_lock);
-			}
 		}
 	}
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:36   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
as a wrapper for two very different checks: shmem_huge_enabled() prefers
to show those two checks explicitly, as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ce3ccaac54d6..c6fa6f4f2db8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
 	loff_t i_size;
 	pgoff_t off;
 
-	if (!transhuge_vma_enabled(vma, vma->vm_flags))
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
@ 2021-07-30  7:36   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
as a wrapper for two very different checks: shmem_huge_enabled() prefers
to show those two checks explicitly, as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ce3ccaac54d6..c6fa6f4f2db8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
 	loff_t i_size;
 	pgoff_t off;
 
-	if (!transhuge_vma_enabled(vma, vma->vm_flags))
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
 		return false;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:39   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

shmem_huge_enabled() is about to be enhanced into shmem_is_huge(),
so that it can be used more widely throughout: before making functional
changes, shift it to its final position (to avoid forward declaration).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 72 ++++++++++++++++++++++++++----------------------------
 1 file changed, 35 insertions(+), 37 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c6fa6f4f2db8..740d48ef1eb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,41 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly;
 
+bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	loff_t i_size;
+	pgoff_t off;
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return false;
+	switch (sbinfo->huge) {
+	case SHMEM_HUGE_NEVER:
+		return false;
+	case SHMEM_HUGE_ALWAYS:
+		return true;
+	case SHMEM_HUGE_WITHIN_SIZE:
+		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >= HPAGE_PMD_SIZE &&
+				i_size >> PAGE_SHIFT >= off)
+			return true;
+		fallthrough;
+	case SHMEM_HUGE_ADVISE:
+		/* TODO: implement fadvise() hints */
+		return (vma->vm_flags & VM_HUGEPAGE);
+	default:
+		VM_BUG_ON(1);
+		return false;
+	}
+}
+
 #if defined(CONFIG_SYSFS)
 static int shmem_parse_huge(const char *str)
 {
@@ -3995,43 +4030,6 @@ struct kobj_attribute shmem_enabled_attr =
 	__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-bool shmem_huge_enabled(struct vm_area_struct *vma)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	loff_t i_size;
-	pgoff_t off;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
-	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
-	switch (sbinfo->huge) {
-		case SHMEM_HUGE_NEVER:
-			return false;
-		case SHMEM_HUGE_ALWAYS:
-			return true;
-		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
-				return true;
-			fallthrough;
-		case SHMEM_HUGE_ADVISE:
-			/* TODO: implement fadvise() hints */
-			return (vma->vm_flags & VM_HUGEPAGE);
-		default:
-			VM_BUG_ON(1);
-			return false;
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards
@ 2021-07-30  7:39   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

shmem_huge_enabled() is about to be enhanced into shmem_is_huge(),
so that it can be used more widely throughout: before making functional
changes, shift it to its final position (to avoid forward declaration).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 72 ++++++++++++++++++++++++++----------------------------
 1 file changed, 35 insertions(+), 37 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c6fa6f4f2db8..740d48ef1eb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,41 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly;
 
+bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	loff_t i_size;
+	pgoff_t off;
+
+	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return false;
+	switch (sbinfo->huge) {
+	case SHMEM_HUGE_NEVER:
+		return false;
+	case SHMEM_HUGE_ALWAYS:
+		return true;
+	case SHMEM_HUGE_WITHIN_SIZE:
+		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >= HPAGE_PMD_SIZE &&
+				i_size >> PAGE_SHIFT >= off)
+			return true;
+		fallthrough;
+	case SHMEM_HUGE_ADVISE:
+		/* TODO: implement fadvise() hints */
+		return (vma->vm_flags & VM_HUGEPAGE);
+	default:
+		VM_BUG_ON(1);
+		return false;
+	}
+}
+
 #if defined(CONFIG_SYSFS)
 static int shmem_parse_huge(const char *str)
 {
@@ -3995,43 +4030,6 @@ struct kobj_attribute shmem_enabled_attr =
 	__ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-bool shmem_huge_enabled(struct vm_area_struct *vma)
-{
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	loff_t i_size;
-	pgoff_t off;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
-	if (shmem_huge == SHMEM_HUGE_DENY)
-		return false;
-	switch (sbinfo->huge) {
-		case SHMEM_HUGE_NEVER:
-			return false;
-		case SHMEM_HUGE_ALWAYS:
-			return true;
-		case SHMEM_HUGE_WITHIN_SIZE:
-			off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
-			i_size = round_up(i_size_read(inode), PAGE_SIZE);
-			if (i_size >= HPAGE_PMD_SIZE &&
-					i_size >> PAGE_SHIFT >= off)
-				return true;
-			fallthrough;
-		case SHMEM_HUGE_ADVISE:
-			/* TODO: implement fadvise() hints */
-			return (vma->vm_flags & VM_HUGEPAGE);
-		default:
-			VM_BUG_ON(1);
-			return false;
-	}
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:42   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
that a consistent set of checks can be applied, even when the inode is
accessed through read/write syscalls (with NULL vma) instead of mmaps
(the index argument is seldom of interest, but required by mount option
"huge=within_size").  Clean up and rearrange the checks a little.

This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
still true that khugepaged's collapse_file() at that point wants a small
page, the race that might allocate it a huge page is too unlikely to be
worth optimizing against (we are there *because* there was at least one
small page in the way), and handled by a later PageTransCompound check.

Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
else needs that symlink check, so leave it there in shmem_getpage_gfp().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |  9 +++--
 mm/khugepaged.c          |  2 +-
 mm/shmem.c               | 84 ++++++++++++----------------------------
 3 files changed, 32 insertions(+), 63 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9b7f7ac52351..3b05a28e34c4 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(unsigned int type, bool frontswap,
 		       unsigned long *fs_pages_to_unuse);
 
-extern bool shmem_huge_enabled(struct vm_area_struct *vma);
+extern bool shmem_is_huge(struct vm_area_struct *vma,
+			  struct inode *inode, pgoff_t index);
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
+}
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
@@ -95,8 +100,6 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_NOHUGE,	/* like SGP_CACHE, but no huge pages */
-	SGP_HUGE,	/* like SGP_CACHE, huge pages preferred */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..cecb19c3e965 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 				xas_unlock_irq(&xas);
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
-						  SGP_NOHUGE)) {
+						  SGP_CACHE)) {
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
diff --git a/mm/shmem.c b/mm/shmem.c
index 740d48ef1eb5..6def7391084c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
-static int shmem_huge __read_mostly;
+static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_huge_enabled(struct vm_area_struct *vma)
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	loff_t i_size;
-	pgoff_t off;
 
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return false;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+
+	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
 		return true;
 	case SHMEM_HUGE_WITHIN_SIZE:
-		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		index = round_up(index, HPAGE_PMD_NR);
 		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-				i_size >> PAGE_SHIFT >= off)
+		if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
 			return true;
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
-		/* TODO: implement fadvise() hints */
-		return (vma->vm_flags & VM_HUGEPAGE);
+		if (vma && (vma->vm_flags & VM_HUGEPAGE))
+			return true;
+		fallthrough;
 	default:
-		VM_BUG_ON(1);
 		return false;
 	}
 }
@@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
+{
+	return false;
+}
+
 static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		struct shrink_control *sc, unsigned long nr_to_split)
 {
@@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct shmem_sb_info *sbinfo;
 	struct mm_struct *charge_mm;
 	struct page *page;
-	enum sgp_type sgp_huge = sgp;
 	pgoff_t hindex = index;
 	gfp_t huge_gfp;
 	int error;
@@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
-	if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
-		sgp = SGP_CACHE;
 repeat:
 	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
@@ -1898,36 +1897,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	/* shmem_symlink() */
-	if (!shmem_mapping(mapping))
-		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
+	/* Never use a huge page for shmem_symlink() */
+	if (S_ISLNK(inode->i_mode))
 		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		goto alloc_huge;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (!shmem_is_huge(vma, inode, index))
 		goto alloc_nohuge;
-	case SHMEM_HUGE_WITHIN_SIZE: {
-		loff_t i_size;
-		pgoff_t off;
-
-		off = round_up(index, HPAGE_PMD_NR);
-		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-		    i_size >> PAGE_SHIFT >= off)
-			goto alloc_huge;
 
-		fallthrough;
-	}
-	case SHMEM_HUGE_ADVISE:
-		if (sgp_huge == SGP_HUGE)
-			goto alloc_huge;
-		/* TODO: implement fadvise() hints */
-		goto alloc_nohuge;
-	}
-
-alloc_huge:
 	huge_gfp = vma_thp_gfp_mask(vma);
 	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
@@ -2083,7 +2058,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
-	enum sgp_type sgp;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2146,15 +2120,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 		spin_unlock(&inode->i_lock);
 	}
 
-	sgp = SGP_CACHE;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		sgp = SGP_NOHUGE;
-	else if (vma->vm_flags & VM_HUGEPAGE)
-		sgp = SGP_HUGE;
-
-	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
+	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
@@ -3961,7 +3927,7 @@ int __init shmem_init(void)
 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
-		shmem_huge = 0; /* just in case it was patched */
+		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
 	return 0;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-07-30  7:42   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
that a consistent set of checks can be applied, even when the inode is
accessed through read/write syscalls (with NULL vma) instead of mmaps
(the index argument is seldom of interest, but required by mount option
"huge=within_size").  Clean up and rearrange the checks a little.

This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
still true that khugepaged's collapse_file() at that point wants a small
page, the race that might allocate it a huge page is too unlikely to be
worth optimizing against (we are there *because* there was at least one
small page in the way), and handled by a later PageTransCompound check.

Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
else needs that symlink check, so leave it there in shmem_getpage_gfp().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |  9 +++--
 mm/khugepaged.c          |  2 +-
 mm/shmem.c               | 84 ++++++++++++----------------------------
 3 files changed, 32 insertions(+), 63 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9b7f7ac52351..3b05a28e34c4 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(unsigned int type, bool frontswap,
 		       unsigned long *fs_pages_to_unuse);
 
-extern bool shmem_huge_enabled(struct vm_area_struct *vma);
+extern bool shmem_is_huge(struct vm_area_struct *vma,
+			  struct inode *inode, pgoff_t index);
+static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
+{
+	return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
+}
 extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
 extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end);
@@ -95,8 +100,6 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_NOHUGE,	/* like SGP_CACHE, but no huge pages */
-	SGP_HUGE,	/* like SGP_CACHE, huge pages preferred */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0412be08fa2..cecb19c3e965 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 				xas_unlock_irq(&xas);
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
-						  SGP_NOHUGE)) {
+						  SGP_CACHE)) {
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
diff --git a/mm/shmem.c b/mm/shmem.c
index 740d48ef1eb5..6def7391084c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
-static int shmem_huge __read_mostly;
+static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
-bool shmem_huge_enabled(struct vm_area_struct *vma)
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
 {
-	struct inode *inode = file_inode(vma->vm_file);
-	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 	loff_t i_size;
-	pgoff_t off;
 
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		return true;
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return false;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return true;
+
+	switch (SHMEM_SB(inode->i_sb)->huge) {
 	case SHMEM_HUGE_ALWAYS:
 		return true;
 	case SHMEM_HUGE_WITHIN_SIZE:
-		off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
+		index = round_up(index, HPAGE_PMD_NR);
 		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-				i_size >> PAGE_SHIFT >= off)
+		if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
 			return true;
 		fallthrough;
 	case SHMEM_HUGE_ADVISE:
-		/* TODO: implement fadvise() hints */
-		return (vma->vm_flags & VM_HUGEPAGE);
+		if (vma && (vma->vm_flags & VM_HUGEPAGE))
+			return true;
+		fallthrough;
 	default:
-		VM_BUG_ON(1);
 		return false;
 	}
 }
@@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+bool shmem_is_huge(struct vm_area_struct *vma,
+		   struct inode *inode, pgoff_t index)
+{
+	return false;
+}
+
 static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 		struct shrink_control *sc, unsigned long nr_to_split)
 {
@@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct shmem_sb_info *sbinfo;
 	struct mm_struct *charge_mm;
 	struct page *page;
-	enum sgp_type sgp_huge = sgp;
 	pgoff_t hindex = index;
 	gfp_t huge_gfp;
 	int error;
@@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
 		return -EFBIG;
-	if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
-		sgp = SGP_CACHE;
 repeat:
 	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
@@ -1898,36 +1897,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	/* shmem_symlink() */
-	if (!shmem_mapping(mapping))
-		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
+	/* Never use a huge page for shmem_symlink() */
+	if (S_ISLNK(inode->i_mode))
 		goto alloc_nohuge;
-	if (shmem_huge == SHMEM_HUGE_FORCE)
-		goto alloc_huge;
-	switch (sbinfo->huge) {
-	case SHMEM_HUGE_NEVER:
+	if (!shmem_is_huge(vma, inode, index))
 		goto alloc_nohuge;
-	case SHMEM_HUGE_WITHIN_SIZE: {
-		loff_t i_size;
-		pgoff_t off;
-
-		off = round_up(index, HPAGE_PMD_NR);
-		i_size = round_up(i_size_read(inode), PAGE_SIZE);
-		if (i_size >= HPAGE_PMD_SIZE &&
-		    i_size >> PAGE_SHIFT >= off)
-			goto alloc_huge;
 
-		fallthrough;
-	}
-	case SHMEM_HUGE_ADVISE:
-		if (sgp_huge == SGP_HUGE)
-			goto alloc_huge;
-		/* TODO: implement fadvise() hints */
-		goto alloc_nohuge;
-	}
-
-alloc_huge:
 	huge_gfp = vma_thp_gfp_mask(vma);
 	huge_gfp = limit_gfp_mask(huge_gfp, gfp);
 	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
@@ -2083,7 +2058,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
-	enum sgp_type sgp;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2146,15 +2120,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 		spin_unlock(&inode->i_lock);
 	}
 
-	sgp = SGP_CACHE;
-
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		sgp = SGP_NOHUGE;
-	else if (vma->vm_flags & VM_HUGEPAGE)
-		sgp = SGP_HUGE;
-
-	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
+	err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
@@ -3961,7 +3927,7 @@ int __init shmem_init(void)
 	if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
-		shmem_huge = 0; /* just in case it was patched */
+		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
 #endif
 	return 0;
 
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:45   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Commit 749df87bd7be ("mm/shmem: add hugetlbfs support to memfd_create()")
in 4.14 added the MFD_HUGETLB flag to memfd_create(), to use hugetlbfs
pages instead of tmpfs pages: now add the MFD_HUGEPAGE flag, to use tmpfs
Transparent Huge Pages when they can be allocated (flag named to follow
the precedent of madvise's MADV_HUGEPAGE for THPs).

/sys/kernel/mm/transparent_hugepage/shmem_enabled "always" or "force"
already made this possible: but that is much too blunt an instrument,
affecting all the very different kinds of files on the internal shmem
mount, and was intended just for ease of testing hugepage loads.

MFD_HUGEPAGE is implemented internally by VM_HUGEPAGE in the shmem inode
flags: do not permit a PR_SET_THP_DISABLE (MMF_DISABLE_THP) task to set
this flag, and do not set it if THPs are not allowed at all; but let the
memfd_create() succeed even in those cases - the caller wants to create a
memfd, just hinting how it's best allocated if huge pages are available.

shmem_is_huge() (at allocation time or khugepaged time) applies its
SHMEM_HUGE_DENY and vma VM_NOHUGEPAGE and vm_mm MMF_DISABLE_THP checks
first, and only then allows the memfd's MFD_HUGEPAGE to take effect.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/uapi/linux/memfd.h |  3 ++-
 mm/memfd.c                 | 24 ++++++++++++++++++------
 mm/shmem.c                 | 33 +++++++++++++++++++++++++++++++--
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..8358a69e78cc 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -7,7 +7,8 @@
 /* flags for memfd_create(2) (unsigned int) */
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
-#define MFD_HUGETLB		0x0004U
+#define MFD_HUGETLB		0x0004U		/* Use hugetlbfs */
+#define MFD_HUGEPAGE		0x0008U		/* Use huge tmpfs */
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..0d1a504d2fc9 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,10 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS  (MFD_CLOEXEC | \
+			MFD_ALLOW_SEALING | \
+			MFD_HUGETLB | \
+			MFD_HUGEPAGE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -257,14 +260,17 @@ SYSCALL_DEFINE2(memfd_create,
 	char *name;
 	long len;
 
-	if (!(flags & MFD_HUGETLB)) {
-		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
+	if (flags & MFD_HUGETLB) {
+		/* Disallow huge tmpfs when choosing hugetlbfs */
+		if (flags & MFD_HUGEPAGE)
 			return -EINVAL;
-	} else {
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
 				(MFD_HUGE_MASK << MFD_HUGE_SHIFT)))
 			return -EINVAL;
+	} else {
+		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
+			return -EINVAL;
 	}
 
 	/* length includes terminating zero */
@@ -303,8 +309,14 @@ SYSCALL_DEFINE2(memfd_create,
 					HUGETLB_ANONHUGE_INODE,
 					(flags >> MFD_HUGE_SHIFT) &
 					MFD_HUGE_MASK);
-	} else
-		file = shmem_file_setup(name, 0, VM_NORESERVE);
+	} else {
+		unsigned long vm_flags = VM_NORESERVE;
+
+		if (flags & MFD_HUGEPAGE)
+			vm_flags |= VM_HUGEPAGE;
+		file = shmem_file_setup(name, 0, vm_flags);
+	}
+
 	if (IS_ERR(file)) {
 		error = PTR_ERR(file);
 		goto err_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index 6def7391084c..e2bcf3313686 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
+/*
+ * Does either /sys/kernel/mm/transparent_hugepage/shmem_enabled or
+ * /sys/kernel/mm/transparent_hugepage/enabled allow transparent hugepages?
+ * (Can only return true when the machine has_transparent_hugepage() too.)
+ */
+static bool transparent_hugepage_allowed(void)
+{
+	return	shmem_huge > SHMEM_HUGE_NEVER ||
+		test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			&transparent_hugepage_flags) ||
+		test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			&transparent_hugepage_flags);
+}
+
 bool shmem_is_huge(struct vm_area_struct *vma,
 		   struct inode *inode, pgoff_t index)
 {
@@ -486,6 +500,8 @@ bool shmem_is_huge(struct vm_area_struct *vma,
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
+		return true;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
 
@@ -676,6 +692,11 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+bool transparent_hugepage_allowed(void)
+{
+	return false;
+}
+
 bool shmem_is_huge(struct vm_area_struct *vma,
 		   struct inode *inode, pgoff_t index)
 {
@@ -2171,10 +2192,14 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge != SHMEM_HUGE_FORCE) {
 		struct super_block *sb;
+		struct inode *inode;
 
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
-			sb = file_inode(file)->i_sb;
+			inode = file_inode(file);
+			if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
+				goto huge;
+			sb = inode->i_sb;
 		} else {
 			/*
 			 * Called directly from mm/mmap.c, or drivers/char/mem.c
@@ -2187,7 +2212,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
 			return addr;
 	}
-
+huge:
 	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
 	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
 		return addr;
@@ -2308,6 +2333,10 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		atomic_set(&info->stop_eviction, 0);
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
+		if ((flags & VM_HUGEPAGE) &&
+		    transparent_hugepage_allowed() &&
+		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
+			info->flags |= VM_HUGEPAGE;
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
@ 2021-07-30  7:45   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Commit 749df87bd7be ("mm/shmem: add hugetlbfs support to memfd_create()")
in 4.14 added the MFD_HUGETLB flag to memfd_create(), to use hugetlbfs
pages instead of tmpfs pages: now add the MFD_HUGEPAGE flag, to use tmpfs
Transparent Huge Pages when they can be allocated (flag named to follow
the precedent of madvise's MADV_HUGEPAGE for THPs).

/sys/kernel/mm/transparent_hugepage/shmem_enabled "always" or "force"
already made this possible: but that is much too blunt an instrument,
affecting all the very different kinds of files on the internal shmem
mount, and was intended just for ease of testing hugepage loads.

MFD_HUGEPAGE is implemented internally by VM_HUGEPAGE in the shmem inode
flags: do not permit a PR_SET_THP_DISABLE (MMF_DISABLE_THP) task to set
this flag, and do not set it if THPs are not allowed at all; but let the
memfd_create() succeed even in those cases - the caller wants to create a
memfd, just hinting how it's best allocated if huge pages are available.

shmem_is_huge() (at allocation time or khugepaged time) applies its
SHMEM_HUGE_DENY and vma VM_NOHUGEPAGE and vm_mm MMF_DISABLE_THP checks
first, and only then allows the memfd's MFD_HUGEPAGE to take effect.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/uapi/linux/memfd.h |  3 ++-
 mm/memfd.c                 | 24 ++++++++++++++++++------
 mm/shmem.c                 | 33 +++++++++++++++++++++++++++++++--
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..8358a69e78cc 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -7,7 +7,8 @@
 /* flags for memfd_create(2) (unsigned int) */
 #define MFD_CLOEXEC		0x0001U
 #define MFD_ALLOW_SEALING	0x0002U
-#define MFD_HUGETLB		0x0004U
+#define MFD_HUGETLB		0x0004U		/* Use hugetlbfs */
+#define MFD_HUGEPAGE		0x0008U		/* Use huge tmpfs */
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 081dd33e6a61..0d1a504d2fc9 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -245,7 +245,10 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS  (MFD_CLOEXEC | \
+			MFD_ALLOW_SEALING | \
+			MFD_HUGETLB | \
+			MFD_HUGEPAGE)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -257,14 +260,17 @@ SYSCALL_DEFINE2(memfd_create,
 	char *name;
 	long len;
 
-	if (!(flags & MFD_HUGETLB)) {
-		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
+	if (flags & MFD_HUGETLB) {
+		/* Disallow huge tmpfs when choosing hugetlbfs */
+		if (flags & MFD_HUGEPAGE)
 			return -EINVAL;
-	} else {
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
 				(MFD_HUGE_MASK << MFD_HUGE_SHIFT)))
 			return -EINVAL;
+	} else {
+		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
+			return -EINVAL;
 	}
 
 	/* length includes terminating zero */
@@ -303,8 +309,14 @@ SYSCALL_DEFINE2(memfd_create,
 					HUGETLB_ANONHUGE_INODE,
 					(flags >> MFD_HUGE_SHIFT) &
 					MFD_HUGE_MASK);
-	} else
-		file = shmem_file_setup(name, 0, VM_NORESERVE);
+	} else {
+		unsigned long vm_flags = VM_NORESERVE;
+
+		if (flags & MFD_HUGEPAGE)
+			vm_flags |= VM_HUGEPAGE;
+		file = shmem_file_setup(name, 0, vm_flags);
+	}
+
 	if (IS_ERR(file)) {
 		error = PTR_ERR(file);
 		goto err_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index 6def7391084c..e2bcf3313686 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,20 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
+/*
+ * Does either /sys/kernel/mm/transparent_hugepage/shmem_enabled or
+ * /sys/kernel/mm/transparent_hugepage/enabled allow transparent hugepages?
+ * (Can only return true when the machine has_transparent_hugepage() too.)
+ */
+static bool transparent_hugepage_allowed(void)
+{
+	return	shmem_huge > SHMEM_HUGE_NEVER ||
+		test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			&transparent_hugepage_flags) ||
+		test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			&transparent_hugepage_flags);
+}
+
 bool shmem_is_huge(struct vm_area_struct *vma,
 		   struct inode *inode, pgoff_t index)
 {
@@ -486,6 +500,8 @@ bool shmem_is_huge(struct vm_area_struct *vma,
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
+		return true;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
 
@@ -676,6 +692,11 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+bool transparent_hugepage_allowed(void)
+{
+	return false;
+}
+
 bool shmem_is_huge(struct vm_area_struct *vma,
 		   struct inode *inode, pgoff_t index)
 {
@@ -2171,10 +2192,14 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge != SHMEM_HUGE_FORCE) {
 		struct super_block *sb;
+		struct inode *inode;
 
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
-			sb = file_inode(file)->i_sb;
+			inode = file_inode(file);
+			if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
+				goto huge;
+			sb = inode->i_sb;
 		} else {
 			/*
 			 * Called directly from mm/mmap.c, or drivers/char/mem.c
@@ -2187,7 +2212,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
 			return addr;
 	}
-
+huge:
 	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
 	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
 		return addr;
@@ -2308,6 +2333,10 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		atomic_set(&info->stop_eviction, 0);
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
+		if ((flags & VM_HUGEPAGE) &&
+		    transparent_hugepage_allowed() &&
+		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
+			info->flags |= VM_HUGEPAGE;
 		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:48   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to
select hugeness per file: useful to override the default hugeness of the
shmem mount, when occasionally needing to store a hugepage file in a
smallpage mount or vice versa.

These fcntls just specify whether or not to try for huge pages when
allocating to the object later: F_HUGEPAGE does not touch small pages
already allocated (though khugepaged may do so when the file is mapped
afterwards), F_NOHUGEPAGE does not split huge pages already allocated.

Why fcntl?  Because it's already in use (for sealing) on memfds; and I'm
anxious to keep this simple, just applying it to whole files: fallocate,
madvise and posix_fadvise each involve a range, which would need a new
kind of tree attached to the inode for proper support.  Any application
needing range support should be able to provide that from userspace, by
issuing the respective fcntl prior to instantiating each range.

Do not allow it when the file is open read-only (EBADF).  Do not permit
a PR_SET_THP_DISABLE (MMF_DISABLE_THP) task to interfere with the flags,
and do not let VM_HUGEPAGE be set if THPs are not allowed at all (EPERM).

Note that transparent_hugepage_allowed(), used to validate F_HUGEPAGE,
accepts (anon) transparent_hugepage_flags in addition to mount option.
This is to overcome the limitation of the "huge=advise" option, which
applies hugepage alignment (reducing ASLR) to all mappings, because
madvise(address,len,MADV_HUGEPAGE) needs address before it can be used.
So mount option "huge=never" gives a default which can be overridden by
fcntl(fd, F_HUGEPAGE) when /sys/kernel/mm/transparent_hugepage/enabled
is not "never" too.  (We could instead add a "huge=fcntl" mount option
between "never" and "advise", but I lack the enthusiasm for that.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 |  5 +++
 include/linux/shmem_fs.h   |  8 +++++
 include/uapi/linux/fcntl.h |  9 +++++
 mm/shmem.c                 | 70 ++++++++++++++++++++++++++++++++++----
 4 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f946bec8f1f1..9cfff87c3332 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -23,6 +23,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 #include <linux/memfd.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
@@ -434,6 +435,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_FILE_RW_HINT:
 		err = fcntl_rw_hint(filp, cmd, arg);
 		break;
+	case F_HUGEPAGE:
+	case F_NOHUGEPAGE:
+		err = shmem_fcntl(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3b05a28e34c4..51b75d74ce89 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -67,6 +67,14 @@ extern int shmem_zero_setup(struct vm_area_struct *);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
+#ifdef CONFIG_TMPFS
+extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+#else
+static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_TMPFS */
 #ifdef CONFIG_SHMEM
 extern const struct address_space_operations shmem_aops;
 static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..10f82b223642 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -73,6 +73,15 @@
  */
 #define RWF_WRITE_LIFE_NOT_SET	RWH_WRITE_LIFE_NOT_SET
 
+/*
+ * Allocate hugepages when available: useful on a tmpfs which was not mounted
+ * with the "huge=always" option, as for memfds.  And, do not allocate hugepages
+ * even when available: useful to cancel the above request, or make an exception
+ * on a tmpfs mounted with "huge=always" (without splitting existing hugepages).
+ */
+#define F_HUGEPAGE		(F_LINUX_SPECIFIC_BASE + 15)
+#define F_NOHUGEPAGE		(F_LINUX_SPECIFIC_BASE + 16)
+
 /*
  * Types of directory notifications that may be requested.
  */
diff --git a/mm/shmem.c b/mm/shmem.c
index e2bcf3313686..67a4b7a4849b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -448,9 +448,9 @@ static bool shmem_confirm_swap(struct address_space *mapping,
  *	enables huge pages for the mount;
  * SHMEM_HUGE_WITHIN_SIZE:
  *	only allocate huge pages if the page will be fully within i_size,
- *	also respect fadvise()/madvise() hints;
+ *	also respect fcntl()/madvise() hints;
  * SHMEM_HUGE_ADVISE:
- *	only allocate huge pages if requested with fadvise()/madvise();
+ *	only allocate huge pages if requested with fcntl()/madvise().
  */
 
 #define SHMEM_HUGE_NEVER	0
@@ -477,13 +477,13 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
 /*
- * Does either /sys/kernel/mm/transparent_hugepage/shmem_enabled or
+ * Does either tmpfs mount option (or transparent_hugepage/shmem_enabled) or
  * /sys/kernel/mm/transparent_hugepage/enabled allow transparent hugepages?
  * (Can only return true when the machine has_transparent_hugepage() too.)
  */
-static bool transparent_hugepage_allowed(void)
+static bool transparent_hugepage_allowed(struct shmem_sb_info *sbinfo)
 {
-	return	shmem_huge > SHMEM_HUGE_NEVER ||
+	return	sbinfo->huge > SHMEM_HUGE_NEVER ||
 		test_bit(TRANSPARENT_HUGEPAGE_FLAG,
 			&transparent_hugepage_flags) ||
 		test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
@@ -500,6 +500,8 @@ bool shmem_is_huge(struct vm_area_struct *vma,
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (SHMEM_I(inode)->flags & VM_NOHUGEPAGE)
+		return false;
 	if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
 		return true;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
@@ -692,7 +694,7 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
-bool transparent_hugepage_allowed(void)
+bool transparent_hugepage_allowed(struct shmem_sb_info *sbinfo)
 {
 	return false;
 }
@@ -2197,6 +2199,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
 			inode = file_inode(file);
+			if (SHMEM_I(inode)->flags & VM_NOHUGEPAGE)
+				return addr;
 			if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
 				goto huge;
 			sb = inode->i_sb;
@@ -2211,6 +2215,11 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		}
 		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
 			return addr;
+		/*
+		 * Note that SHMEM_HUGE_ADVISE has to give out huge-aligned
+		 * addresses to everyone, because madvise(,,MADV_HUGEPAGE)
+		 * needs the address-chicken on which to advise if huge-egg.
+		 */
 	}
 huge:
 	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
@@ -2334,7 +2343,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
 		if ((flags & VM_HUGEPAGE) &&
-		    transparent_hugepage_allowed() &&
+		    transparent_hugepage_allowed(sbinfo) &&
 		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
 			info->flags |= VM_HUGEPAGE;
 		INIT_LIST_HEAD(&info->shrinklist);
@@ -2674,6 +2683,53 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static int shmem_huge_fcntl(struct file *file, unsigned int cmd)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+	if (test_bit(MMF_DISABLE_THP, &current->mm->flags))
+		return -EPERM;
+	if (cmd == F_HUGEPAGE &&
+	    !transparent_hugepage_allowed(SHMEM_SB(inode->i_sb)))
+		return -EPERM;
+
+	inode_lock(inode);
+	if (cmd == F_HUGEPAGE) {
+		info->flags &= ~VM_NOHUGEPAGE;
+		info->flags |= VM_HUGEPAGE;
+	} else {
+		info->flags &= ~VM_HUGEPAGE;
+		info->flags |= VM_NOHUGEPAGE;
+	}
+	inode_unlock(inode);
+	return 0;
+}
+
+long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	long error = -EINVAL;
+
+	if (file->f_op != &shmem_file_operations)
+		return error;
+
+	switch (cmd) {
+	/*
+	 * case F_ADD_SEALS:
+	 * case F_GET_SEALS:
+	 *	are handled by memfd_fcntl().
+	 */
+	case F_HUGEPAGE:
+	case F_NOHUGEPAGE:
+		error = shmem_huge_fcntl(file, cmd);
+		break;
+	}
+
+	return error;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
@ 2021-07-30  7:48   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to
select hugeness per file: useful to override the default hugeness of the
shmem mount, when occasionally needing to store a hugepage file in a
smallpage mount or vice versa.

These fcntls just specify whether or not to try for huge pages when
allocating to the object later: F_HUGEPAGE does not touch small pages
already allocated (though khugepaged may do so when the file is mapped
afterwards), F_NOHUGEPAGE does not split huge pages already allocated.

Why fcntl?  Because it's already in use (for sealing) on memfds; and I'm
anxious to keep this simple, just applying it to whole files: fallocate,
madvise and posix_fadvise each involve a range, which would need a new
kind of tree attached to the inode for proper support.  Any application
needing range support should be able to provide that from userspace, by
issuing the respective fcntl prior to instantiating each range.

Do not allow it when the file is open read-only (EBADF).  Do not permit
a PR_SET_THP_DISABLE (MMF_DISABLE_THP) task to interfere with the flags,
and do not let VM_HUGEPAGE be set if THPs are not allowed at all (EPERM).

Note that transparent_hugepage_allowed(), used to validate F_HUGEPAGE,
accepts (anon) transparent_hugepage_flags in addition to mount option.
This is to overcome the limitation of the "huge=advise" option, which
applies hugepage alignment (reducing ASLR) to all mappings, because
madvise(address,len,MADV_HUGEPAGE) needs address before it can be used.
So mount option "huge=never" gives a default which can be overridden by
fcntl(fd, F_HUGEPAGE) when /sys/kernel/mm/transparent_hugepage/enabled
is not "never" too.  (We could instead add a "huge=fcntl" mount option
between "never" and "advise", but I lack the enthusiasm for that.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 |  5 +++
 include/linux/shmem_fs.h   |  8 +++++
 include/uapi/linux/fcntl.h |  9 +++++
 mm/shmem.c                 | 70 ++++++++++++++++++++++++++++++++++----
 4 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f946bec8f1f1..9cfff87c3332 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -23,6 +23,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 #include <linux/memfd.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
@@ -434,6 +435,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_FILE_RW_HINT:
 		err = fcntl_rw_hint(filp, cmd, arg);
 		break;
+	case F_HUGEPAGE:
+	case F_NOHUGEPAGE:
+		err = shmem_fcntl(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 3b05a28e34c4..51b75d74ce89 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -67,6 +67,14 @@ extern int shmem_zero_setup(struct vm_area_struct *);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
+#ifdef CONFIG_TMPFS
+extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+#else
+static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
+{
+	return -EINVAL;
+}
+#endif /* CONFIG_TMPFS */
 #ifdef CONFIG_SHMEM
 extern const struct address_space_operations shmem_aops;
 static inline bool shmem_mapping(struct address_space *mapping)
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..10f82b223642 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -73,6 +73,15 @@
  */
 #define RWF_WRITE_LIFE_NOT_SET	RWH_WRITE_LIFE_NOT_SET
 
+/*
+ * Allocate hugepages when available: useful on a tmpfs which was not mounted
+ * with the "huge=always" option, as for memfds.  And, do not allocate hugepages
+ * even when available: useful to cancel the above request, or make an exception
+ * on a tmpfs mounted with "huge=always" (without splitting existing hugepages).
+ */
+#define F_HUGEPAGE		(F_LINUX_SPECIFIC_BASE + 15)
+#define F_NOHUGEPAGE		(F_LINUX_SPECIFIC_BASE + 16)
+
 /*
  * Types of directory notifications that may be requested.
  */
diff --git a/mm/shmem.c b/mm/shmem.c
index e2bcf3313686..67a4b7a4849b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -448,9 +448,9 @@ static bool shmem_confirm_swap(struct address_space *mapping,
  *	enables huge pages for the mount;
  * SHMEM_HUGE_WITHIN_SIZE:
  *	only allocate huge pages if the page will be fully within i_size,
- *	also respect fadvise()/madvise() hints;
+ *	also respect fcntl()/madvise() hints;
  * SHMEM_HUGE_ADVISE:
- *	only allocate huge pages if requested with fadvise()/madvise();
+ *	only allocate huge pages if requested with fcntl()/madvise().
  */
 
 #define SHMEM_HUGE_NEVER	0
@@ -477,13 +477,13 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
 
 /*
- * Does either /sys/kernel/mm/transparent_hugepage/shmem_enabled or
+ * Does either tmpfs mount option (or transparent_hugepage/shmem_enabled) or
  * /sys/kernel/mm/transparent_hugepage/enabled allow transparent hugepages?
  * (Can only return true when the machine has_transparent_hugepage() too.)
  */
-static bool transparent_hugepage_allowed(void)
+static bool transparent_hugepage_allowed(struct shmem_sb_info *sbinfo)
 {
-	return	shmem_huge > SHMEM_HUGE_NEVER ||
+	return	sbinfo->huge > SHMEM_HUGE_NEVER ||
 		test_bit(TRANSPARENT_HUGEPAGE_FLAG,
 			&transparent_hugepage_flags) ||
 		test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
@@ -500,6 +500,8 @@ bool shmem_is_huge(struct vm_area_struct *vma,
 	if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
 	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
 		return false;
+	if (SHMEM_I(inode)->flags & VM_NOHUGEPAGE)
+		return false;
 	if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
 		return true;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
@@ -692,7 +694,7 @@ static long shmem_unused_huge_count(struct super_block *sb,
 
 #define shmem_huge SHMEM_HUGE_DENY
 
-bool transparent_hugepage_allowed(void)
+bool transparent_hugepage_allowed(struct shmem_sb_info *sbinfo)
 {
 	return false;
 }
@@ -2197,6 +2199,8 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
 			inode = file_inode(file);
+			if (SHMEM_I(inode)->flags & VM_NOHUGEPAGE)
+				return addr;
 			if (SHMEM_I(inode)->flags & VM_HUGEPAGE)
 				goto huge;
 			sb = inode->i_sb;
@@ -2211,6 +2215,11 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 		}
 		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
 			return addr;
+		/*
+		 * Note that SHMEM_HUGE_ADVISE has to give out huge-aligned
+		 * addresses to everyone, because madvise(,,MADV_HUGEPAGE)
+		 * needs the address-chicken on which to advise if huge-egg.
+		 */
 	}
 huge:
 	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
@@ -2334,7 +2343,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
 		if ((flags & VM_HUGEPAGE) &&
-		    transparent_hugepage_allowed() &&
+		    transparent_hugepage_allowed(sbinfo) &&
 		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
 			info->flags |= VM_HUGEPAGE;
 		INIT_LIST_HEAD(&info->shrinklist);
@@ -2674,6 +2683,53 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+static int shmem_huge_fcntl(struct file *file, unsigned int cmd)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+	if (test_bit(MMF_DISABLE_THP, &current->mm->flags))
+		return -EPERM;
+	if (cmd == F_HUGEPAGE &&
+	    !transparent_hugepage_allowed(SHMEM_SB(inode->i_sb)))
+		return -EPERM;
+
+	inode_lock(inode);
+	if (cmd == F_HUGEPAGE) {
+		info->flags &= ~VM_NOHUGEPAGE;
+		info->flags |= VM_HUGEPAGE;
+	} else {
+		info->flags &= ~VM_HUGEPAGE;
+		info->flags |= VM_NOHUGEPAGE;
+	}
+	inode_unlock(inode);
+	return 0;
+}
+
+long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	long error = -EINVAL;
+
+	if (file->f_op != &shmem_file_operations)
+		return error;
+
+	switch (cmd) {
+	/*
+	 * case F_ADD_SEALS:
+	 * case F_GET_SEALS:
+	 *	are handled by memfd_fcntl().
+	 */
+	case F_HUGEPAGE:
+	case F_NOHUGEPAGE:
+		error = shmem_huge_fcntl(file, cmd);
+		break;
+	}
+
+	return error;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge()
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:51   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
page size if THP is on") added is_huge_enabled() to decide st_blksize:
now that hugeness can be defined per file, that too needs to be replaced
by shmem_is_huge().

Unless they have been fcntl'ed F_HUGEPAGE, this does give a different
answer (No) for small files on a "huge=within_size" mount: but that can
be considered a minor bugfix.  And a different answer (No) for unfcntl'ed
files on a "huge=advise" mount: I'm reluctant to complicate it, just to
reproduce the same debatable answer as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 67a4b7a4849b..f50f2ede71da 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -712,15 +712,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
-{
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
-	    shmem_huge != SHMEM_HUGE_DENY)
-		return true;
-	return false;
-}
-
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
@@ -1101,7 +1092,6 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = path->dentry->d_inode;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb);
 
 	if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
 		spin_lock_irq(&info->lock);
@@ -1110,7 +1100,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 	}
 	generic_fillattr(&init_user_ns, inode, stat);
 
-	if (is_huge_enabled(sb_info))
+	if (shmem_is_huge(NULL, inode, 0))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	return 0;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge()
@ 2021-07-30  7:51   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
page size if THP is on") added is_huge_enabled() to decide st_blksize:
now that hugeness can be defined per file, that too needs to be replaced
by shmem_is_huge().

Unless they have been fcntl'ed F_HUGEPAGE, this does give a different
answer (No) for small files on a "huge=within_size" mount: but that can
be considered a minor bugfix.  And a different answer (No) for unfcntl'ed
files on a "huge=advise" mount: I'm reluctant to complicate it, just to
reproduce the same debatable answer as before.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 67a4b7a4849b..f50f2ede71da 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -712,15 +712,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
-{
-	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-	    (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
-	    shmem_huge != SHMEM_HUGE_DENY)
-		return true;
-	return false;
-}
-
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
@@ -1101,7 +1092,6 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 {
 	struct inode *inode = path->dentry->d_inode;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb);
 
 	if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
 		spin_lock_irq(&info->lock);
@@ -1110,7 +1100,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
 	}
 	generic_fillattr(&init_user_ns, inode, stat);
 
-	if (is_huge_enabled(sb_info))
+	if (shmem_is_huge(NULL, inode, 0))
 		stat->blksize = HPAGE_PMD_SIZE;
 
 	return 0;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:55   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

From: Shakeel Butt <shakeelb@google.com>

A new uapi to lock the files on tmpfs in memory, to protect against swap
without mapping the files. This commit introduces two new commands to
fcntl and shmem: F_MEM_LOCK and F_MEM_UNLOCK. The locking will be
charged against RLIMIT_MEMLOCK of uid in namespace of the caller.

This feature is implemented by mostly re-using the shmctl's SHM_LOCK
mechanism (System V IPC shared memory). This api follows the design
choices of shmctl's SHM_LOCK and also of mlock2 syscall where pages
on swap are not populated on the syscall. The pages will be brought
to memory on first access.

As with System V shared memory, these pages are counted as Unevictable
in /proc/meminfo (when they are allocated, or when page reclaim finds
any allocated earlier), but they are not counted as Mlocked there.

For simplicity the locked files are forbidden to grow or shrink
to keep the user accounting simple. This design decision will be
revisited once such use-case arises.

The permissions to lock and unlock differs slightly from other similar
interfaces. Anyone having CAP_IPC_LOCK or remaining rlimit can lock
the file, but the unlocker has to have either CAP_IPC_LOCK or it
should be the locker itself.

This commit does not make the locked status of a tmpfs file visible.
We can add an F_MEM_LOCKED fcntl later, to query that status if
required; but it's not yet clear how best to make it visible.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 |  2 ++
 include/linux/shmem_fs.h   |  1 +
 include/uapi/linux/fcntl.h |  7 +++++
 mm/shmem.c                 | 59 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9cfff87c3332..a3534764b50e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -437,6 +437,8 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_HUGEPAGE:
 	case F_NOHUGEPAGE:
+	case F_MEM_LOCK:
+	case F_MEM_UNLOCK:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
 	default:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 51b75d74ce89..ffdd0da816e5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -24,6 +24,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+	struct ucounts		*mlock_ucounts;	/* user memlocked tmpfs file */
 	struct inode		vfs_inode;
 };
 
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 10f82b223642..21dc969df0fd 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -82,6 +82,13 @@
 #define F_HUGEPAGE		(F_LINUX_SPECIFIC_BASE + 15)
 #define F_NOHUGEPAGE		(F_LINUX_SPECIFIC_BASE + 16)
 
+/*
+ * Lock all pages of file into memory, as they are allocated; or unlock them.
+ * Currently supported only on tmpfs, and on its memfd_created files.
+ */
+#define F_MEM_LOCK		(F_LINUX_SPECIFIC_BASE + 17)
+#define F_MEM_UNLOCK		(F_LINUX_SPECIFIC_BASE + 18)
+
 /*
  * Types of directory notifications that may be requested.
  */
diff --git a/mm/shmem.c b/mm/shmem.c
index f50f2ede71da..ba9b9900287b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -888,7 +888,7 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
 }
 
 /*
- * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
+ * SHM_UNLOCK or F_MEM_UNLOCK restore Unevictable pages to their evictable list.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
 {
@@ -897,7 +897,7 @@ void shmem_unlock_mapping(struct address_space *mapping)
 
 	pagevec_init(&pvec);
 	/*
-	 * Minor point, but we might as well stop if someone else SHM_LOCKs it.
+	 * Minor point, but we might as well stop if someone else memlocks it.
 	 */
 	while (!mapping_unevictable(mapping)) {
 		if (!pagevec_lookup(&pvec, mapping, &index))
@@ -1123,7 +1123,8 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 
 		/* protected by i_mutex */
 		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
-		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)) ||
+		    (newsize != oldsize && info->mlock_ucounts))
 			return -EPERM;
 
 		if (newsize != oldsize) {
@@ -1161,6 +1162,10 @@ static void shmem_evict_inode(struct inode *inode)
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
 	if (shmem_mapping(inode->i_mapping)) {
+		if (info->mlock_ucounts) {
+			user_shm_unlock(inode->i_size, info->mlock_ucounts);
+			info->mlock_ucounts = NULL;
+		}
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
 		shmem_truncate_range(inode, 0, (loff_t)-1);
@@ -2266,6 +2271,7 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 
 	/*
 	 * What serializes the accesses to info->flags?
+	 * inode_lock() when called from shmem_memlock_fcntl(),
 	 * ipc_lock_object() when called from shmctl_do_lock(),
 	 * no serialization needed when called from shm_destroy().
 	 */
@@ -2286,6 +2292,43 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	return retval;
 }
 
+static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool cleanup_mapping = false;
+	int retval = 0;
+
+	inode_lock(inode);
+	if (cmd == F_MEM_LOCK) {
+		if (!info->mlock_ucounts) {
+			struct ucounts *ucounts = current_ucounts();
+			/* capability/rlimit check is down in user_shm_lock */
+			retval = shmem_lock(file, 1, ucounts);
+			if (!retval)
+				info->mlock_ucounts = ucounts;
+			else if (!rlimit(RLIMIT_MEMLOCK))
+				retval = -EPERM;
+			/* else retval == -ENOMEM */
+		}
+	} else { /* F_MEM_UNLOCK */
+		if (info->mlock_ucounts) {
+			if (info->mlock_ucounts == current_ucounts() ||
+			    capable(CAP_IPC_LOCK)) {
+				shmem_lock(file, 0, info->mlock_ucounts);
+				info->mlock_ucounts = NULL;
+				cleanup_mapping = true;
+			} else
+				retval = -EPERM;
+		}
+	}
+	inode_unlock(inode);
+
+	if (cleanup_mapping)
+		shmem_unlock_mapping(file->f_mapping);
+	return retval;
+}
+
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct shmem_inode_info *info = SHMEM_I(file_inode(file));
@@ -2503,6 +2546,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
 	}
+	if (unlikely(info->mlock_ucounts) && pos + len > inode->i_size)
+		return -EPERM;
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
 }
@@ -2715,6 +2760,10 @@ long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 	case F_NOHUGEPAGE:
 		error = shmem_huge_fcntl(file, cmd);
 		break;
+	case F_MEM_LOCK:
+	case F_MEM_UNLOCK:
+		error = shmem_memlock_fcntl(file, cmd);
+		break;
 	}
 
 	return error;
@@ -2778,6 +2827,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		error = -EPERM;
 		goto out;
 	}
+	if (info->mlock_ucounts && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
 
 	start = offset >> PAGE_SHIFT;
 	end = (offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
@ 2021-07-30  7:55   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

From: Shakeel Butt <shakeelb@google.com>

A new uapi to lock the files on tmpfs in memory, to protect against swap
without mapping the files. This commit introduces two new commands to
fcntl and shmem: F_MEM_LOCK and F_MEM_UNLOCK. The locking will be
charged against RLIMIT_MEMLOCK of uid in namespace of the caller.

This feature is implemented by mostly re-using the shmctl's SHM_LOCK
mechanism (System V IPC shared memory). This api follows the design
choices of shmctl's SHM_LOCK and also of mlock2 syscall where pages
on swap are not populated on the syscall. The pages will be brought
to memory on first access.

As with System V shared memory, these pages are counted as Unevictable
in /proc/meminfo (when they are allocated, or when page reclaim finds
any allocated earlier), but they are not counted as Mlocked there.

For simplicity the locked files are forbidden to grow or shrink
to keep the user accounting simple. This design decision will be
revisited once such use-case arises.

The permissions to lock and unlock differs slightly from other similar
interfaces. Anyone having CAP_IPC_LOCK or remaining rlimit can lock
the file, but the unlocker has to have either CAP_IPC_LOCK or it
should be the locker itself.

This commit does not make the locked status of a tmpfs file visible.
We can add an F_MEM_LOCKED fcntl later, to query that status if
required; but it's not yet clear how best to make it visible.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 |  2 ++
 include/linux/shmem_fs.h   |  1 +
 include/uapi/linux/fcntl.h |  7 +++++
 mm/shmem.c                 | 59 ++++++++++++++++++++++++++++++++++++--
 4 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9cfff87c3332..a3534764b50e 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -437,6 +437,8 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_HUGEPAGE:
 	case F_NOHUGEPAGE:
+	case F_MEM_LOCK:
+	case F_MEM_UNLOCK:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
 	default:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 51b75d74ce89..ffdd0da816e5 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -24,6 +24,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	atomic_t		stop_eviction;	/* hold when working on inode */
+	struct ucounts		*mlock_ucounts;	/* user memlocked tmpfs file */
 	struct inode		vfs_inode;
 };
 
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 10f82b223642..21dc969df0fd 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -82,6 +82,13 @@
 #define F_HUGEPAGE		(F_LINUX_SPECIFIC_BASE + 15)
 #define F_NOHUGEPAGE		(F_LINUX_SPECIFIC_BASE + 16)
 
+/*
+ * Lock all pages of file into memory, as they are allocated; or unlock them.
+ * Currently supported only on tmpfs, and on its memfd_created files.
+ */
+#define F_MEM_LOCK		(F_LINUX_SPECIFIC_BASE + 17)
+#define F_MEM_UNLOCK		(F_LINUX_SPECIFIC_BASE + 18)
+
 /*
  * Types of directory notifications that may be requested.
  */
diff --git a/mm/shmem.c b/mm/shmem.c
index f50f2ede71da..ba9b9900287b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -888,7 +888,7 @@ unsigned long shmem_swap_usage(struct vm_area_struct *vma)
 }
 
 /*
- * SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
+ * SHM_UNLOCK or F_MEM_UNLOCK restore Unevictable pages to their evictable list.
  */
 void shmem_unlock_mapping(struct address_space *mapping)
 {
@@ -897,7 +897,7 @@ void shmem_unlock_mapping(struct address_space *mapping)
 
 	pagevec_init(&pvec);
 	/*
-	 * Minor point, but we might as well stop if someone else SHM_LOCKs it.
+	 * Minor point, but we might as well stop if someone else memlocks it.
 	 */
 	while (!mapping_unevictable(mapping)) {
 		if (!pagevec_lookup(&pvec, mapping, &index))
@@ -1123,7 +1123,8 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 
 		/* protected by i_mutex */
 		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
-		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)) ||
+		    (newsize != oldsize && info->mlock_ucounts))
 			return -EPERM;
 
 		if (newsize != oldsize) {
@@ -1161,6 +1162,10 @@ static void shmem_evict_inode(struct inode *inode)
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
 	if (shmem_mapping(inode->i_mapping)) {
+		if (info->mlock_ucounts) {
+			user_shm_unlock(inode->i_size, info->mlock_ucounts);
+			info->mlock_ucounts = NULL;
+		}
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
 		shmem_truncate_range(inode, 0, (loff_t)-1);
@@ -2266,6 +2271,7 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 
 	/*
 	 * What serializes the accesses to info->flags?
+	 * inode_lock() when called from shmem_memlock_fcntl(),
 	 * ipc_lock_object() when called from shmctl_do_lock(),
 	 * no serialization needed when called from shm_destroy().
 	 */
@@ -2286,6 +2292,43 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	return retval;
 }
 
+static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool cleanup_mapping = false;
+	int retval = 0;
+
+	inode_lock(inode);
+	if (cmd == F_MEM_LOCK) {
+		if (!info->mlock_ucounts) {
+			struct ucounts *ucounts = current_ucounts();
+			/* capability/rlimit check is down in user_shm_lock */
+			retval = shmem_lock(file, 1, ucounts);
+			if (!retval)
+				info->mlock_ucounts = ucounts;
+			else if (!rlimit(RLIMIT_MEMLOCK))
+				retval = -EPERM;
+			/* else retval == -ENOMEM */
+		}
+	} else { /* F_MEM_UNLOCK */
+		if (info->mlock_ucounts) {
+			if (info->mlock_ucounts == current_ucounts() ||
+			    capable(CAP_IPC_LOCK)) {
+				shmem_lock(file, 0, info->mlock_ucounts);
+				info->mlock_ucounts = NULL;
+				cleanup_mapping = true;
+			} else
+				retval = -EPERM;
+		}
+	}
+	inode_unlock(inode);
+
+	if (cleanup_mapping)
+		shmem_unlock_mapping(file->f_mapping);
+	return retval;
+}
+
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct shmem_inode_info *info = SHMEM_I(file_inode(file));
@@ -2503,6 +2546,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 		if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
 			return -EPERM;
 	}
+	if (unlikely(info->mlock_ucounts) && pos + len > inode->i_size)
+		return -EPERM;
 
 	return shmem_getpage(inode, index, pagep, SGP_WRITE);
 }
@@ -2715,6 +2760,10 @@ long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 	case F_NOHUGEPAGE:
 		error = shmem_huge_fcntl(file, cmd);
 		break;
+	case F_MEM_LOCK:
+	case F_MEM_UNLOCK:
+		error = shmem_memlock_fcntl(file, cmd);
+		break;
 	}
 
 	return error;
@@ -2778,6 +2827,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		error = -EPERM;
 		goto out;
 	}
+	if (info->mlock_ucounts && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
 
 	start = offset >> PAGE_SHIFT;
 	end = (offset + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/16] tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  7:57   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Though we have not yet found a compelling need to make the locked status
of a tmpfs file visible, and offer no tool to show it, the kernel ought
to be able to support such a tool: add the F_MEM_LOCKED fcntl, returning
-1 on failure (not tmpfs), 0 when not F_MEM_LOCKED, 1 when F_MEM_LOCKED.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 | 1 +
 include/uapi/linux/fcntl.h | 1 +
 mm/shmem.c                 | 4 ++++
 3 files changed, 6 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index a3534764b50e..0d8dc723732d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -439,6 +439,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_NOHUGEPAGE:
 	case F_MEM_LOCK:
 	case F_MEM_UNLOCK:
+	case F_MEM_LOCKED:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
 	default:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 21dc969df0fd..012585e8c9ab 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -88,6 +88,7 @@
  */
 #define F_MEM_LOCK		(F_LINUX_SPECIFIC_BASE + 17)
 #define F_MEM_UNLOCK		(F_LINUX_SPECIFIC_BASE + 18)
+#define F_MEM_LOCKED		(F_LINUX_SPECIFIC_BASE + 19)
 
 /*
  * Types of directory notifications that may be requested.
diff --git a/mm/shmem.c b/mm/shmem.c
index ba9b9900287b..6e53dabe658b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2299,6 +2299,9 @@ static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
 	bool cleanup_mapping = false;
 	int retval = 0;
 
+	if (cmd == F_MEM_LOCKED)
+		return !!info->mlock_ucounts;
+
 	inode_lock(inode);
 	if (cmd == F_MEM_LOCK) {
 		if (!info->mlock_ucounts) {
@@ -2762,6 +2765,7 @@ long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case F_MEM_LOCK:
 	case F_MEM_UNLOCK:
+	case F_MEM_LOCKED:
 		error = shmem_memlock_fcntl(file, cmd);
 		break;
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 11/16] tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked
@ 2021-07-30  7:57   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  7:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Though we have not yet found a compelling need to make the locked status
of a tmpfs file visible, and offer no tool to show it, the kernel ought
to be able to support such a tool: add the F_MEM_LOCKED fcntl, returning
-1 on failure (not tmpfs), 0 when not F_MEM_LOCKED, 1 when F_MEM_LOCKED.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/fcntl.c                 | 1 +
 include/uapi/linux/fcntl.h | 1 +
 mm/shmem.c                 | 4 ++++
 3 files changed, 6 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index a3534764b50e..0d8dc723732d 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -439,6 +439,7 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_NOHUGEPAGE:
 	case F_MEM_LOCK:
 	case F_MEM_UNLOCK:
+	case F_MEM_LOCKED:
 		err = shmem_fcntl(filp, cmd, arg);
 		break;
 	default:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 21dc969df0fd..012585e8c9ab 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -88,6 +88,7 @@
  */
 #define F_MEM_LOCK		(F_LINUX_SPECIFIC_BASE + 17)
 #define F_MEM_UNLOCK		(F_LINUX_SPECIFIC_BASE + 18)
+#define F_MEM_LOCKED		(F_LINUX_SPECIFIC_BASE + 19)
 
 /*
  * Types of directory notifications that may be requested.
diff --git a/mm/shmem.c b/mm/shmem.c
index ba9b9900287b..6e53dabe658b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2299,6 +2299,9 @@ static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
 	bool cleanup_mapping = false;
 	int retval = 0;
 
+	if (cmd == F_MEM_LOCKED)
+		return !!info->mlock_ucounts;
+
 	inode_lock(inode);
 	if (cmd == F_MEM_LOCK) {
 		if (!info->mlock_ucounts) {
@@ -2762,6 +2765,7 @@ long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 		break;
 	case F_MEM_LOCK:
 	case F_MEM_UNLOCK:
+	case F_MEM_LOCKED:
 		error = shmem_memlock_fcntl(file, cmd);
 		break;
 	}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/16] tmpfs: refuse memlock when fallocated beyond i_size
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  8:00   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

F_MEM_LOCK is accounted by i_size, but fallocate(,FALLOC_FL_KEEP_SIZE,,)
could have added many pages beyond i_size, which would also be held as
Unevictable from memory. The mlock_ucounts check in shmem_fallocate() is
fine, but shmem_memlock_fcntl() needs to check fallocend too. We could
change F_MEM_LOCK accounting to use the max of i_size and fallocend, but
fallocend is obscure: I think it's better just to refuse the F_MEM_LOCK
(with EPERM) if fallocend exceeds (page-rounded) i_size.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6e53dabe658b..35c0f5c7120e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2304,7 +2304,10 @@ static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
 
 	inode_lock(inode);
 	if (cmd == F_MEM_LOCK) {
-		if (!info->mlock_ucounts) {
+		if (info->fallocend > DIV_ROUND_UP(inode->i_size, PAGE_SIZE)) {
+			/* locking is accounted by i_size: disallow excess */
+			retval = -EPERM;
+		} else if (!info->mlock_ucounts) {
 			struct ucounts *ucounts = current_ucounts();
 			/* capability/rlimit check is down in user_shm_lock */
 			retval = shmem_lock(file, 1, ucounts);
@@ -2854,9 +2857,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	spin_unlock(&inode->i_lock);
 
 	/*
-	 * info->fallocend is only relevant when huge pages might be
+	 * info->fallocend is mostly relevant when huge pages might be
 	 * involved: to prevent split_huge_page() freeing fallocated
 	 * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
+	 * But it is also checked in F_MEM_LOCK validation.
 	 */
 	undo_fallocend = info->fallocend;
 	if (info->fallocend < end)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 12/16] tmpfs: refuse memlock when fallocated beyond i_size
@ 2021-07-30  8:00   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

F_MEM_LOCK is accounted by i_size, but fallocate(,FALLOC_FL_KEEP_SIZE,,)
could have added many pages beyond i_size, which would also be held as
Unevictable from memory. The mlock_ucounts check in shmem_fallocate() is
fine, but shmem_memlock_fcntl() needs to check fallocend too. We could
change F_MEM_LOCK accounting to use the max of i_size and fallocend, but
fallocend is obscure: I think it's better just to refuse the F_MEM_LOCK
(with EPERM) if fallocend exceeds (page-rounded) i_size.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6e53dabe658b..35c0f5c7120e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2304,7 +2304,10 @@ static int shmem_memlock_fcntl(struct file *file, unsigned int cmd)
 
 	inode_lock(inode);
 	if (cmd == F_MEM_LOCK) {
-		if (!info->mlock_ucounts) {
+		if (info->fallocend > DIV_ROUND_UP(inode->i_size, PAGE_SIZE)) {
+			/* locking is accounted by i_size: disallow excess */
+			retval = -EPERM;
+		} else if (!info->mlock_ucounts) {
 			struct ucounts *ucounts = current_ucounts();
 			/* capability/rlimit check is down in user_shm_lock */
 			retval = shmem_lock(file, 1, ucounts);
@@ -2854,9 +2857,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	spin_unlock(&inode->i_lock);
 
 	/*
-	 * info->fallocend is only relevant when huge pages might be
+	 * info->fallocend is mostly relevant when huge pages might be
 	 * involved: to prevent split_huge_page() freeing fallocated
 	 * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
+	 * But it is also checked in F_MEM_LOCK validation.
 	 */
 	undo_fallocend = info->fallocend;
 	if (info->fallocend < end)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/16] mm: bool user_shm_lock(loff_t size, struct ucounts *)
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  8:03   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

user_shm_lock()'s size_t size was big enough for SysV SHM locking, but
not quite big enough for O_LARGEFILE on 32-bit: change to loff_t size.
And while changing the prototype, let's use bool rather than int here.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |  4 ++--
 mm/mlock.c         | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..f1be2221512b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1713,8 +1713,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct ucounts *);
-extern void user_shm_unlock(size_t, struct ucounts *);
+extern bool user_shm_lock(loff_t size, struct ucounts *ucounts);
+extern void user_shm_unlock(loff_t size, struct ucounts *ucounts);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/mm/mlock.c b/mm/mlock.c
index 16d2ee160d43..7df88fce0fc9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -813,21 +813,21 @@ SYSCALL_DEFINE0(munlockall)
 }
 
 /*
- * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
- * shm segments) get accounted against the user_struct instead.
+ * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB shm
+ * segments and F_MEM_LOCK tmpfs) get accounted to the user_namespace instead.
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
-int user_shm_lock(size_t size, struct ucounts *ucounts)
+bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 {
 	unsigned long lock_limit, locked;
 	long memlock;
-	int allowed = 0;
+	bool allowed = false;
 
 	locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	if (lock_limit == RLIM_INFINITY)
-		allowed = 1;
+		allowed = true;
 	lock_limit >>= PAGE_SHIFT;
 	spin_lock(&shmlock_user_lock);
 	memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
@@ -840,13 +840,13 @@ int user_shm_lock(size_t size, struct ucounts *ucounts)
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
-	allowed = 1;
+	allowed = true;
 out:
 	spin_unlock(&shmlock_user_lock);
 	return allowed;
 }
 
-void user_shm_unlock(size_t size, struct ucounts *ucounts)
+void user_shm_unlock(loff_t size, struct ucounts *ucounts)
 {
 	spin_lock(&shmlock_user_lock);
 	dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 13/16] mm: bool user_shm_lock(loff_t size, struct ucounts *)
@ 2021-07-30  8:03   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

user_shm_lock()'s size_t size was big enough for SysV SHM locking, but
not quite big enough for O_LARGEFILE on 32-bit: change to loff_t size.
And while changing the prototype, let's use bool rather than int here.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |  4 ++--
 mm/mlock.c         | 14 +++++++-------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..f1be2221512b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1713,8 +1713,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern int user_shm_lock(size_t, struct ucounts *);
-extern void user_shm_unlock(size_t, struct ucounts *);
+extern bool user_shm_lock(loff_t size, struct ucounts *ucounts);
+extern void user_shm_unlock(loff_t size, struct ucounts *ucounts);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/mm/mlock.c b/mm/mlock.c
index 16d2ee160d43..7df88fce0fc9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -813,21 +813,21 @@ SYSCALL_DEFINE0(munlockall)
 }
 
 /*
- * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB
- * shm segments) get accounted against the user_struct instead.
+ * Objects with different lifetime than processes (SHM_LOCK and SHM_HUGETLB shm
+ * segments and F_MEM_LOCK tmpfs) get accounted to the user_namespace instead.
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
-int user_shm_lock(size_t size, struct ucounts *ucounts)
+bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 {
 	unsigned long lock_limit, locked;
 	long memlock;
-	int allowed = 0;
+	bool allowed = false;
 
 	locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	lock_limit = rlimit(RLIMIT_MEMLOCK);
 	if (lock_limit == RLIM_INFINITY)
-		allowed = 1;
+		allowed = true;
 	lock_limit >>= PAGE_SHIFT;
 	spin_lock(&shmlock_user_lock);
 	memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
@@ -840,13 +840,13 @@ int user_shm_lock(size_t size, struct ucounts *ucounts)
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
-	allowed = 1;
+	allowed = true;
 out:
 	spin_unlock(&shmlock_user_lock);
 	return allowed;
 }
 
-void user_shm_unlock(size_t size, struct ucounts *ucounts)
+void user_shm_unlock(loff_t size, struct ucounts *ucounts)
 {
 	spin_lock(&shmlock_user_lock);
 	dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/16] mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc)
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  8:06   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

user_shm_lock() and user_shm_unlock() have to get and put a reference on
the ucounts structure, and get fails at overflow.  That will be awkward
for the next commit (shrinking ought not to fail), so add an argument
(always true in this commit) to condition that get and put.  It would
be even easier to do the put_ucounts() separately when unlocking, but
messy for the get_ucounts() when locking: better to keep them symmetric.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/hugetlbfs/inode.c | 4 ++--
 include/linux/mm.h   | 4 ++--
 ipc/shm.c            | 4 ++--
 mm/mlock.c           | 9 +++++----
 mm/shmem.c           | 6 +++---
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cdfb1ae78a3f..381902288f4d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1465,7 +1465,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
 
 	if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
 		*ucounts = current_ucounts();
-		if (user_shm_lock(size, *ucounts)) {
+		if (user_shm_lock(size, *ucounts, true)) {
 			task_lock(current);
 			pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
 				current->comm, current->pid);
@@ -1499,7 +1499,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
 	iput(inode);
 out:
 	if (*ucounts) {
-		user_shm_unlock(size, *ucounts);
+		user_shm_unlock(size, *ucounts, true);
 		*ucounts = NULL;
 	}
 	return file;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1be2221512b..43cb5a6f97ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1713,8 +1713,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern bool user_shm_lock(loff_t size, struct ucounts *ucounts);
-extern void user_shm_unlock(loff_t size, struct ucounts *ucounts);
+extern bool user_shm_lock(loff_t size, struct ucounts *ucounts, bool getuc);
+extern void user_shm_unlock(loff_t size, struct ucounts *ucounts, bool putuc);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..3e63809d38b7 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -289,7 +289,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
 	else if (shp->mlock_ucounts)
 		user_shm_unlock(i_size_read(file_inode(shm_file)),
-				shp->mlock_ucounts);
+				shp->mlock_ucounts, true);
 	fput(shm_file);
 	ipc_update_pid(&shp->shm_cprid, NULL);
 	ipc_update_pid(&shp->shm_lprid, NULL);
@@ -699,7 +699,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	ipc_update_pid(&shp->shm_cprid, NULL);
 	ipc_update_pid(&shp->shm_lprid, NULL);
 	if (is_file_hugepages(file) && shp->mlock_ucounts)
-		user_shm_unlock(size, shp->mlock_ucounts);
+		user_shm_unlock(size, shp->mlock_ucounts, true);
 	fput(file);
 	ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
 	return error;
diff --git a/mm/mlock.c b/mm/mlock.c
index 7df88fce0fc9..5afa3eba9a13 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -818,7 +818,7 @@ SYSCALL_DEFINE0(munlockall)
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
-bool user_shm_lock(loff_t size, struct ucounts *ucounts)
+bool user_shm_lock(loff_t size, struct ucounts *ucounts, bool getuc)
 {
 	unsigned long lock_limit, locked;
 	long memlock;
@@ -836,7 +836,7 @@ bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
-	if (!get_ucounts(ucounts)) {
+	if (getuc && !get_ucounts(ucounts)) {
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
@@ -846,10 +846,11 @@ bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 	return allowed;
 }
 
-void user_shm_unlock(loff_t size, struct ucounts *ucounts)
+void user_shm_unlock(loff_t size, struct ucounts *ucounts, bool putuc)
 {
 	spin_lock(&shmlock_user_lock);
 	dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
 	spin_unlock(&shmlock_user_lock);
-	put_ucounts(ucounts);
+	if (putuc)
+		put_ucounts(ucounts);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 35c0f5c7120e..1ddb910e976c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1163,7 +1163,7 @@ static void shmem_evict_inode(struct inode *inode)
 
 	if (shmem_mapping(inode->i_mapping)) {
 		if (info->mlock_ucounts) {
-			user_shm_unlock(inode->i_size, info->mlock_ucounts);
+			user_shm_unlock(inode->i_size, info->mlock_ucounts, true);
 			info->mlock_ucounts = NULL;
 		}
 		shmem_unacct_size(info->flags, inode->i_size);
@@ -2276,13 +2276,13 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	 * no serialization needed when called from shm_destroy().
 	 */
 	if (lock && !(info->flags & VM_LOCKED)) {
-		if (!user_shm_lock(inode->i_size, ucounts))
+		if (!user_shm_lock(inode->i_size, ucounts, true))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
 		mapping_set_unevictable(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && ucounts) {
-		user_shm_unlock(inode->i_size, ucounts);
+		user_shm_unlock(inode->i_size, ucounts, true);
 		info->flags &= ~VM_LOCKED;
 		mapping_clear_unevictable(file->f_mapping);
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 14/16] mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc)
@ 2021-07-30  8:06   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

user_shm_lock() and user_shm_unlock() have to get and put a reference on
the ucounts structure, and get fails at overflow.  That will be awkward
for the next commit (shrinking ought not to fail), so add an argument
(always true in this commit) to condition that get and put.  It would
be even easier to do the put_ucounts() separately when unlocking, but
messy for the get_ucounts() when locking: better to keep them symmetric.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/hugetlbfs/inode.c | 4 ++--
 include/linux/mm.h   | 4 ++--
 ipc/shm.c            | 4 ++--
 mm/mlock.c           | 9 +++++----
 mm/shmem.c           | 6 +++---
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cdfb1ae78a3f..381902288f4d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1465,7 +1465,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
 
 	if (creat_flags == HUGETLB_SHMFS_INODE && !can_do_hugetlb_shm()) {
 		*ucounts = current_ucounts();
-		if (user_shm_lock(size, *ucounts)) {
+		if (user_shm_lock(size, *ucounts, true)) {
 			task_lock(current);
 			pr_warn_once("%s (%d): Using mlock ulimits for SHM_HUGETLB is deprecated\n",
 				current->comm, current->pid);
@@ -1499,7 +1499,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size,
 	iput(inode);
 out:
 	if (*ucounts) {
-		user_shm_unlock(size, *ucounts);
+		user_shm_unlock(size, *ucounts, true);
 		*ucounts = NULL;
 	}
 	return file;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1be2221512b..43cb5a6f97ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1713,8 +1713,8 @@ extern bool can_do_mlock(void);
 #else
 static inline bool can_do_mlock(void) { return false; }
 #endif
-extern bool user_shm_lock(loff_t size, struct ucounts *ucounts);
-extern void user_shm_unlock(loff_t size, struct ucounts *ucounts);
+extern bool user_shm_lock(loff_t size, struct ucounts *ucounts, bool getuc);
+extern void user_shm_unlock(loff_t size, struct ucounts *ucounts, bool putuc);
 
 /*
  * Parameter block passed down to zap_pte_range in exceptional cases.
diff --git a/ipc/shm.c b/ipc/shm.c
index 748933e376ca..3e63809d38b7 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -289,7 +289,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 		shmem_lock(shm_file, 0, shp->mlock_ucounts);
 	else if (shp->mlock_ucounts)
 		user_shm_unlock(i_size_read(file_inode(shm_file)),
-				shp->mlock_ucounts);
+				shp->mlock_ucounts, true);
 	fput(shm_file);
 	ipc_update_pid(&shp->shm_cprid, NULL);
 	ipc_update_pid(&shp->shm_lprid, NULL);
@@ -699,7 +699,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 	ipc_update_pid(&shp->shm_cprid, NULL);
 	ipc_update_pid(&shp->shm_lprid, NULL);
 	if (is_file_hugepages(file) && shp->mlock_ucounts)
-		user_shm_unlock(size, shp->mlock_ucounts);
+		user_shm_unlock(size, shp->mlock_ucounts, true);
 	fput(file);
 	ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
 	return error;
diff --git a/mm/mlock.c b/mm/mlock.c
index 7df88fce0fc9..5afa3eba9a13 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -818,7 +818,7 @@ SYSCALL_DEFINE0(munlockall)
  */
 static DEFINE_SPINLOCK(shmlock_user_lock);
 
-bool user_shm_lock(loff_t size, struct ucounts *ucounts)
+bool user_shm_lock(loff_t size, struct ucounts *ucounts, bool getuc)
 {
 	unsigned long lock_limit, locked;
 	long memlock;
@@ -836,7 +836,7 @@ bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
-	if (!get_ucounts(ucounts)) {
+	if (getuc && !get_ucounts(ucounts)) {
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
@@ -846,10 +846,11 @@ bool user_shm_lock(loff_t size, struct ucounts *ucounts)
 	return allowed;
 }
 
-void user_shm_unlock(loff_t size, struct ucounts *ucounts)
+void user_shm_unlock(loff_t size, struct ucounts *ucounts, bool putuc)
 {
 	spin_lock(&shmlock_user_lock);
 	dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
 	spin_unlock(&shmlock_user_lock);
-	put_ucounts(ucounts);
+	if (putuc)
+		put_ucounts(ucounts);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 35c0f5c7120e..1ddb910e976c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1163,7 +1163,7 @@ static void shmem_evict_inode(struct inode *inode)
 
 	if (shmem_mapping(inode->i_mapping)) {
 		if (info->mlock_ucounts) {
-			user_shm_unlock(inode->i_size, info->mlock_ucounts);
+			user_shm_unlock(inode->i_size, info->mlock_ucounts, true);
 			info->mlock_ucounts = NULL;
 		}
 		shmem_unacct_size(info->flags, inode->i_size);
@@ -2276,13 +2276,13 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	 * no serialization needed when called from shm_destroy().
 	 */
 	if (lock && !(info->flags & VM_LOCKED)) {
-		if (!user_shm_lock(inode->i_size, ucounts))
+		if (!user_shm_lock(inode->i_size, ucounts, true))
 			goto out_nomem;
 		info->flags |= VM_LOCKED;
 		mapping_set_unevictable(file->f_mapping);
 	}
 	if (!lock && (info->flags & VM_LOCKED) && ucounts) {
-		user_shm_unlock(inode->i_size, ucounts);
+		user_shm_unlock(inode->i_size, ucounts, true);
 		info->flags &= ~VM_LOCKED;
 		mapping_clear_unevictable(file->f_mapping);
 	}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/16] tmpfs: permit changing size of memlocked file
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  8:09   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

We have users who change the size of their memlocked file by F_MEM_UNLOCK,
ftruncate, F_MEM_LOCK.  That risks swapout in between, and is distasteful:
particularly if the file is very large (when shmem_unlock_mapping() has a
lot of work to move pages off the Unevictable list, only for them to be
moved back there later on).

Modify shmem_setattr() to grow or shrink, and shmem_fallocate() to grow,
the locked extent.  But forbid (EPERM) both if current_ucounts() differs
from the locker's mlock_ucounts (without even a CAP_IPC_LOCK override).
They could be permitted (the caller already has unsealed write access),
but it's probably less confusing to restrict size change to the locker.

But leave shmem_write_begin() as is, preventing the memlocked file from
being extended implicitly by writes beyond EOF: I think that it's best to
demand an explicit size change, by truncate or fallocate, when memlocked.

(But notice in testing "echo x >memlockedfile" how the O_TRUNC succeeds
but the write fails: would F_MEM_UNLOCK on truncation to 0 be better?)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1ddb910e976c..fa4a264453bf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1123,15 +1123,30 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 
 		/* protected by i_mutex */
 		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
-		    (newsize > oldsize && (info->seals & F_SEAL_GROW)) ||
-		    (newsize != oldsize && info->mlock_ucounts))
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
 		if (newsize != oldsize) {
-			error = shmem_reacct_size(SHMEM_I(inode)->flags,
-					oldsize, newsize);
+			struct ucounts *ucounts = info->mlock_ucounts;
+
+			if (ucounts && ucounts != current_ucounts())
+				return -EPERM;
+			error = shmem_reacct_size(info->flags,
+						  oldsize, newsize);
 			if (error)
 				return error;
+			if (ucounts) {
+				loff_t mlock = round_up(newsize, PAGE_SIZE) -
+						round_up(oldsize, PAGE_SIZE);
+				if (mlock < 0) {
+					user_shm_unlock(-mlock, ucounts, false);
+				} else if (mlock > 0 &&
+					!user_shm_lock(mlock, ucounts, false)) {
+					shmem_reacct_size(info->flags,
+							  newsize, oldsize);
+					return -EPERM;
+				}
+			}
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = current_time(inode);
 		}
@@ -2784,6 +2799,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end, undo_fallocend;
+	loff_t mlock = 0;
 	int error;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2830,13 +2846,23 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
-	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
-		error = -EPERM;
-		goto out;
-	}
-	if (info->mlock_ucounts && offset + len > inode->i_size) {
+	if (offset + len > inode->i_size) {
 		error = -EPERM;
-		goto out;
+		if (info->seals & F_SEAL_GROW)
+			goto out;
+		if (info->mlock_ucounts) {
+			if (info->mlock_ucounts != current_ucounts() ||
+			    (mode & FALLOC_FL_KEEP_SIZE))
+				goto out;
+			mlock = round_up(offset + len, PAGE_SIZE) -
+				round_up(inode->i_size, PAGE_SIZE);
+			if (mlock > 0 &&
+			    !user_shm_lock(mlock, info->mlock_ucounts, false)) {
+				mlock = 0;
+				goto out;
+			}
+		}
+		error = 0;
 	}
 
 	start = offset >> PAGE_SHIFT;
@@ -2932,6 +2958,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = NULL;
 	spin_unlock(&inode->i_lock);
 out:
+	if (error && mlock > 0)
+		user_shm_unlock(mlock, info->mlock_ucounts, false);
 	inode_unlock(inode);
 	return error;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 15/16] tmpfs: permit changing size of memlocked file
@ 2021-07-30  8:09   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

We have users who change the size of their memlocked file by F_MEM_UNLOCK,
ftruncate, F_MEM_LOCK.  That risks swapout in between, and is distasteful:
particularly if the file is very large (when shmem_unlock_mapping() has a
lot of work to move pages off the Unevictable list, only for them to be
moved back there later on).

Modify shmem_setattr() to grow or shrink, and shmem_fallocate() to grow,
the locked extent.  But forbid (EPERM) both if current_ucounts() differs
from the locker's mlock_ucounts (without even a CAP_IPC_LOCK override).
They could be permitted (the caller already has unsealed write access),
but it's probably less confusing to restrict size change to the locker.

But leave shmem_write_begin() as is, preventing the memlocked file from
being extended implicitly by writes beyond EOF: I think that it's best to
demand an explicit size change, by truncate or fallocate, when memlocked.

(But notice in testing "echo x >memlockedfile" how the O_TRUNC succeeds
but the write fails: would F_MEM_UNLOCK on truncation to 0 be better?)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1ddb910e976c..fa4a264453bf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1123,15 +1123,30 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
 
 		/* protected by i_mutex */
 		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
-		    (newsize > oldsize && (info->seals & F_SEAL_GROW)) ||
-		    (newsize != oldsize && info->mlock_ucounts))
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
 			return -EPERM;
 
 		if (newsize != oldsize) {
-			error = shmem_reacct_size(SHMEM_I(inode)->flags,
-					oldsize, newsize);
+			struct ucounts *ucounts = info->mlock_ucounts;
+
+			if (ucounts && ucounts != current_ucounts())
+				return -EPERM;
+			error = shmem_reacct_size(info->flags,
+						  oldsize, newsize);
 			if (error)
 				return error;
+			if (ucounts) {
+				loff_t mlock = round_up(newsize, PAGE_SIZE) -
+						round_up(oldsize, PAGE_SIZE);
+				if (mlock < 0) {
+					user_shm_unlock(-mlock, ucounts, false);
+				} else if (mlock > 0 &&
+					!user_shm_lock(mlock, ucounts, false)) {
+					shmem_reacct_size(info->flags,
+							  newsize, oldsize);
+					return -EPERM;
+				}
+			}
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = current_time(inode);
 		}
@@ -2784,6 +2799,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end, undo_fallocend;
+	loff_t mlock = 0;
 	int error;
 
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
@@ -2830,13 +2846,23 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
-	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
-		error = -EPERM;
-		goto out;
-	}
-	if (info->mlock_ucounts && offset + len > inode->i_size) {
+	if (offset + len > inode->i_size) {
 		error = -EPERM;
-		goto out;
+		if (info->seals & F_SEAL_GROW)
+			goto out;
+		if (info->mlock_ucounts) {
+			if (info->mlock_ucounts != current_ucounts() ||
+			    (mode & FALLOC_FL_KEEP_SIZE))
+				goto out;
+			mlock = round_up(offset + len, PAGE_SIZE) -
+				round_up(inode->i_size, PAGE_SIZE);
+			if (mlock > 0 &&
+			    !user_shm_lock(mlock, info->mlock_ucounts, false)) {
+				mlock = 0;
+				goto out;
+			}
+		}
+		error = 0;
 	}
 
 	start = offset >> PAGE_SHIFT;
@@ -2932,6 +2958,8 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	inode->i_private = NULL;
 	spin_unlock(&inode->i_lock);
 out:
+	if (error && mlock > 0)
+		user_shm_unlock(mlock, info->mlock_ucounts, false);
 	inode_unlock(inode);
 	return error;
 }
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem
  2021-07-30  7:22 ` Hugh Dickins
@ 2021-07-30  8:13   ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Now that the size of a memlocked file can be changed, memfd_create() can
accept an MFD_MEM_LOCK flag to request memlocking, even though the initial
size is of course 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 |  7 +++++--
 mm/shmem.c                 | 13 ++++++++++++-
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 8358a69e78cc..9113b5aa1763 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -9,6 +9,7 @@
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U		/* Use hugetlbfs */
 #define MFD_HUGEPAGE		0x0008U		/* Use huge tmpfs */
+#define MFD_MEM_LOCK		0x0010U		/* Memlock tmpfs */
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 0d1a504d2fc9..e39f9eed55d2 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -248,7 +248,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_ALL_FLAGS  (MFD_CLOEXEC | \
 			MFD_ALLOW_SEALING | \
 			MFD_HUGETLB | \
-			MFD_HUGEPAGE)
+			MFD_HUGEPAGE | \
+			MFD_MEM_LOCK)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -262,7 +263,7 @@ SYSCALL_DEFINE2(memfd_create,
 
 	if (flags & MFD_HUGETLB) {
 		/* Disallow huge tmpfs when choosing hugetlbfs */
-		if (flags & MFD_HUGEPAGE)
+		if (flags & (MFD_HUGEPAGE | MFD_MEM_LOCK))
 			return -EINVAL;
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
@@ -314,6 +315,8 @@ SYSCALL_DEFINE2(memfd_create,
 
 		if (flags & MFD_HUGEPAGE)
 			vm_flags |= VM_HUGEPAGE;
+		if (flags & MFD_MEM_LOCK)
+			vm_flags |= VM_LOCKED;
 		file = shmem_file_setup(name, 0, vm_flags);
 	}
 
diff --git a/mm/shmem.c b/mm/shmem.c
index fa4a264453bf..a0a83e59ae07 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2395,7 +2395,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		spin_lock_init(&info->lock);
 		atomic_set(&info->stop_eviction, 0);
 		info->seals = F_SEAL_SEAL;
-		info->flags = flags & VM_NORESERVE;
+		info->flags = flags & (VM_NORESERVE | VM_LOCKED);
 		if ((flags & VM_HUGEPAGE) &&
 		    transparent_hugepage_allowed(sbinfo) &&
 		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
@@ -4254,6 +4254,17 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l
 	inode->i_size = size;
 	clear_nlink(inode);	/* It is unlinked */
 	res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
+	if (!IS_ERR(res) && (flags & VM_LOCKED)) {
+		struct ucounts *ucounts = current_ucounts();
+		/*
+		 * Only memfd_create() may pass VM_LOCKED, and it passes
+		 * size 0; but avoid that assumption in case it changes.
+		 */
+		if (user_shm_lock(size, ucounts, true))
+			SHMEM_I(inode)->mlock_ucounts = ucounts;
+		else
+			res = ERR_PTR(-EPERM);
+	}
 	if (!IS_ERR(res))
 		res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
 				&shmem_file_operations);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem
@ 2021-07-30  8:13   ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-07-30  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

Now that the size of a memlocked file can be changed, memfd_create() can
accept an MFD_MEM_LOCK flag to request memlocking, even though the initial
size is of course 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/uapi/linux/memfd.h |  1 +
 mm/memfd.c                 |  7 +++++--
 mm/shmem.c                 | 13 ++++++++++++-
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 8358a69e78cc..9113b5aa1763 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -9,6 +9,7 @@
 #define MFD_ALLOW_SEALING	0x0002U
 #define MFD_HUGETLB		0x0004U		/* Use hugetlbfs */
 #define MFD_HUGEPAGE		0x0008U		/* Use huge tmpfs */
+#define MFD_MEM_LOCK		0x0010U		/* Memlock tmpfs */
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index 0d1a504d2fc9..e39f9eed55d2 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -248,7 +248,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
 #define MFD_ALL_FLAGS  (MFD_CLOEXEC | \
 			MFD_ALLOW_SEALING | \
 			MFD_HUGETLB | \
-			MFD_HUGEPAGE)
+			MFD_HUGEPAGE | \
+			MFD_MEM_LOCK)
 
 SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
@@ -262,7 +263,7 @@ SYSCALL_DEFINE2(memfd_create,
 
 	if (flags & MFD_HUGETLB) {
 		/* Disallow huge tmpfs when choosing hugetlbfs */
-		if (flags & MFD_HUGEPAGE)
+		if (flags & (MFD_HUGEPAGE | MFD_MEM_LOCK))
 			return -EINVAL;
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
@@ -314,6 +315,8 @@ SYSCALL_DEFINE2(memfd_create,
 
 		if (flags & MFD_HUGEPAGE)
 			vm_flags |= VM_HUGEPAGE;
+		if (flags & MFD_MEM_LOCK)
+			vm_flags |= VM_LOCKED;
 		file = shmem_file_setup(name, 0, vm_flags);
 	}
 
diff --git a/mm/shmem.c b/mm/shmem.c
index fa4a264453bf..a0a83e59ae07 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2395,7 +2395,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 		spin_lock_init(&info->lock);
 		atomic_set(&info->stop_eviction, 0);
 		info->seals = F_SEAL_SEAL;
-		info->flags = flags & VM_NORESERVE;
+		info->flags = flags & (VM_NORESERVE | VM_LOCKED);
 		if ((flags & VM_HUGEPAGE) &&
 		    transparent_hugepage_allowed(sbinfo) &&
 		    !test_bit(MMF_DISABLE_THP, &current->mm->flags))
@@ -4254,6 +4254,17 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l
 	inode->i_size = size;
 	clear_nlink(inode);	/* It is unlinked */
 	res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
+	if (!IS_ERR(res) && (flags & VM_LOCKED)) {
+		struct ucounts *ucounts = current_ucounts();
+		/*
+		 * Only memfd_create() may pass VM_LOCKED, and it passes
+		 * size 0; but avoid that assumption in case it changes.
+		 */
+		if (user_shm_lock(size, ucounts, true))
+			SHMEM_I(inode)->mlock_ucounts = ucounts;
+		else
+			res = ERR_PTR(-EPERM);
+	}
 	if (!IS_ERR(res))
 		res = alloc_file_pseudo(inode, mnt, name, O_RDWR,
 				&shmem_file_operations);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem
  2021-07-30  8:13   ` Hugh Dickins
@ 2021-07-30 11:24     ` kernel test robot
  -1 siblings, 0 replies; 91+ messages in thread
From: kernel test robot @ 2021-07-30 11:24 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: kbuild-all, Linux Memory Management List, Hugh Dickins,
	Shakeel Butt, Kirill A. Shutemov, Yang Shi, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 1768 bytes --]

Hi Hugh,

I love your patch! Yet something to improve:

[auto build test ERROR on hch-configfs/for-next]
[also build test ERROR on linus/master v5.14-rc3 next-20210729]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: riscv-randconfig-r022-20210730 (attached as .config)
compiler: riscv64-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/cad243cc46d563d62f2e0a20faa6571f7c6a692e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
        git checkout cad243cc46d563d62f2e0a20faa6571f7c6a692e
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   riscv64-linux-ld: mm/shmem.o: in function `.L14':
>> shmem.c:(.text+0xec): undefined reference to `user_shm_lock'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34992 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem
@ 2021-07-30 11:24     ` kernel test robot
  0 siblings, 0 replies; 91+ messages in thread
From: kernel test robot @ 2021-07-30 11:24 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1807 bytes --]

Hi Hugh,

I love your patch! Yet something to improve:

[auto build test ERROR on hch-configfs/for-next]
[also build test ERROR on linus/master v5.14-rc3 next-20210729]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: riscv-randconfig-r022-20210730 (attached as .config)
compiler: riscv64-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/cad243cc46d563d62f2e0a20faa6571f7c6a692e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
        git checkout cad243cc46d563d62f2e0a20faa6571f7c6a692e
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   riscv64-linux-ld: mm/shmem.o: in function `.L14':
>> shmem.c:(.text+0xec): undefined reference to `user_shm_lock'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 34992 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
  2021-07-30  7:45   ` Hugh Dickins
@ 2021-07-30 12:01     ` kernel test robot
  -1 siblings, 0 replies; 91+ messages in thread
From: kernel test robot @ 2021-07-30 12:01 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: kbuild-all, Linux Memory Management List, Hugh Dickins,
	Shakeel Butt, Kirill A. Shutemov, Yang Shi, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 2844 bytes --]

Hi Hugh,

I love your patch! Perhaps something to improve:

[auto build test WARNING on hch-configfs/for-next]
[also build test WARNING on linus/master v5.14-rc3 next-20210729]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: ia64-randconfig-r016-20210730 (attached as .config)
compiler: ia64-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7d2aab7d716c95551b2bcab51ca8ff33c2d1dd58
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
        git checkout 7d2aab7d716c95551b2bcab51ca8ff33c2d1dd58
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:11,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from mm/shmem.c:24:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:41: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |  unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                         ^~~~~~~
   mm/shmem.c: At top level:
>> mm/shmem.c:695:6: warning: no previous prototype for 'transparent_hugepage_allowed' [-Wmissing-prototypes]
     695 | bool transparent_hugepage_allowed(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/transparent_hugepage_allowed +695 mm/shmem.c

   694	
 > 695	bool transparent_hugepage_allowed(void)
   696	{
   697		return false;
   698	}
   699	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39101 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
@ 2021-07-30 12:01     ` kernel test robot
  0 siblings, 0 replies; 91+ messages in thread
From: kernel test robot @ 2021-07-30 12:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2908 bytes --]

Hi Hugh,

I love your patch! Perhaps something to improve:

[auto build test WARNING on hch-configfs/for-next]
[also build test WARNING on linus/master v5.14-rc3 next-20210729]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
base:   git://git.infradead.org/users/hch/configfs.git for-next
config: ia64-randconfig-r016-20210730 (attached as .config)
compiler: ia64-linux-gcc (GCC) 10.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/7d2aab7d716c95551b2bcab51ca8ff33c2d1dd58
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Hugh-Dickins/tmpfs-HUGEPAGE-and-MEM_LOCK-fcntls-and-memfds/20210730-161413
        git checkout 7d2aab7d716c95551b2bcab51ca8ff33c2d1dd58
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-10.3.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from arch/ia64/include/asm/pgtable.h:153,
                    from include/linux/pgtable.h:6,
                    from arch/ia64/include/asm/uaccess.h:40,
                    from include/linux/uaccess.h:11,
                    from include/linux/sched/task.h:11,
                    from include/linux/sched/signal.h:9,
                    from include/linux/rcuwait.h:6,
                    from include/linux/percpu-rwsem.h:7,
                    from include/linux/fs.h:33,
                    from mm/shmem.c:24:
   arch/ia64/include/asm/mmu_context.h: In function 'reload_context':
   arch/ia64/include/asm/mmu_context.h:127:41: warning: variable 'old_rr4' set but not used [-Wunused-but-set-variable]
     127 |  unsigned long rr0, rr1, rr2, rr3, rr4, old_rr4;
         |                                         ^~~~~~~
   mm/shmem.c: At top level:
>> mm/shmem.c:695:6: warning: no previous prototype for 'transparent_hugepage_allowed' [-Wmissing-prototypes]
     695 | bool transparent_hugepage_allowed(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/transparent_hugepage_allowed +695 mm/shmem.c

   694	
 > 695	bool transparent_hugepage_allowed(void)
   696	{
   697		return false;
   698	}
   699	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39101 bytes --]

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
  2021-07-30  7:25   ` Hugh Dickins
@ 2021-07-30 21:36     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
>
> shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> pages !Uptodate, partly to identify and undo them on failure, partly to
> leave the overhead of clearing them until later.  But the huge page case
> did not skip to the end of the extent, walked through the tail pages one
> by one, and appeared to work just fine: but in doing so, cleared and
> Uptodated the huge page, so there was no way to undo it on failure.
>
> Now advance immediately to the end of the huge extent, with a comment on
> why this is more than just an optimization.  But although this speeds up
> huge tmpfs fallocation, it does leave the clearing until first use, and
> some users may have come to appreciate slow fallocate but fast first use:
> if they complain, then we can consider adding a pass to clear at the end.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

A nit below:

> ---
>  mm/shmem.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 70d9ce294bb4..0cd5c9156457 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode->i_private = &shmem_falloc;
>         spin_unlock(&inode->i_lock);
>
> -       for (index = start; index < end; index++) {
> +       for (index = start; index < end; ) {
>                 struct page *page;
>
>                 /*
> @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                         goto undone;
>                 }
>
> +               index++;
> +               /*
> +                * Here is a more important optimization than it appears:
> +                * a second SGP_FALLOC on the same huge page will clear it,
> +                * making it PageUptodate and un-undoable if we fail later.
> +                */
> +               if (PageTransCompound(page)) {
> +                       index = round_up(index, HPAGE_PMD_NR);
> +                       /* Beware 32-bit wraparound */
> +                       if (!index)
> +                               index--;
> +               }
> +
>                 /*
>                  * Inform shmem_writepage() how far we have reached.
>                  * No need for lock or barrier: we have the page lock.
>                  */
> -               shmem_falloc.next++;
>                 if (!PageUptodate(page))
> -                       shmem_falloc.nr_falloced++;
> +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> +               shmem_falloc.next = index;

This also fixed the wrong accounting of nr_falloced, so it should be
able to avoid returning -ENOMEM prematurely IIUC. Is it worth
mentioning in the commit log?

>
>                 /*
>                  * If !PageUptodate, leave it that way so that freeable pages
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
@ 2021-07-30 21:36     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
>
> shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> pages !Uptodate, partly to identify and undo them on failure, partly to
> leave the overhead of clearing them until later.  But the huge page case
> did not skip to the end of the extent, walked through the tail pages one
> by one, and appeared to work just fine: but in doing so, cleared and
> Uptodated the huge page, so there was no way to undo it on failure.
>
> Now advance immediately to the end of the huge extent, with a comment on
> why this is more than just an optimization.  But although this speeds up
> huge tmpfs fallocation, it does leave the clearing until first use, and
> some users may have come to appreciate slow fallocate but fast first use:
> if they complain, then we can consider adding a pass to clear at the end.
>
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

A nit below:

> ---
>  mm/shmem.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 70d9ce294bb4..0cd5c9156457 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode->i_private = &shmem_falloc;
>         spin_unlock(&inode->i_lock);
>
> -       for (index = start; index < end; index++) {
> +       for (index = start; index < end; ) {
>                 struct page *page;
>
>                 /*
> @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                         goto undone;
>                 }
>
> +               index++;
> +               /*
> +                * Here is a more important optimization than it appears:
> +                * a second SGP_FALLOC on the same huge page will clear it,
> +                * making it PageUptodate and un-undoable if we fail later.
> +                */
> +               if (PageTransCompound(page)) {
> +                       index = round_up(index, HPAGE_PMD_NR);
> +                       /* Beware 32-bit wraparound */
> +                       if (!index)
> +                               index--;
> +               }
> +
>                 /*
>                  * Inform shmem_writepage() how far we have reached.
>                  * No need for lock or barrier: we have the page lock.
>                  */
> -               shmem_falloc.next++;
>                 if (!PageUptodate(page))
> -                       shmem_falloc.nr_falloced++;
> +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> +               shmem_falloc.next = index;

This also fixed the wrong accounting of nr_falloced, so it should be
able to avoid returning -ENOMEM prematurely IIUC. Is it worth
mentioning in the commit log?

>
>                 /*
>                  * If !PageUptodate, leave it that way so that freeable pages
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr()
  2021-07-30  7:30   ` Hugh Dickins
@ 2021-07-30 21:50     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:31 AM Hugh Dickins <hughd@google.com> wrote:
>
> There's a block of code in shmem_setattr() to add the inode to
> shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
> from before 5.7 changed truncation to do split_huge_page() for itself,
> and should have been removed at that time.
>
> I am over-stating that: split_huge_page() can fail (notably if there's
> an extra reference to the page at that time), so there might be value in
> retrying.  But there were already retries as truncation worked through
> the tails, and this addition risks repeating unsuccessful retries
> indefinitely: I'd rather remove it now, and work on reducing the
> chance of split_huge_page() failures separately, if we need to.

Yes, agreed. Reviewed-by: Yang Shi <shy828301@gmail.com>

>
> Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/shmem.c | 19 -------------------
>  1 file changed, 19 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 24c9da6b41c2..ce3ccaac54d6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1061,7 +1061,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>  {
>         struct inode *inode = d_inode(dentry);
>         struct shmem_inode_info *info = SHMEM_I(inode);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         int error;
>
>         error = setattr_prepare(&init_user_ns, dentry, attr);
> @@ -1097,24 +1096,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>                         if (oldsize > holebegin)
>                                 unmap_mapping_range(inode->i_mapping,
>                                                         holebegin, 0, 1);
> -
> -                       /*
> -                        * Part of the huge page can be beyond i_size: subject
> -                        * to shrink under memory pressure.
> -                        */
> -                       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -                               spin_lock(&sbinfo->shrinklist_lock);
> -                               /*
> -                                * _careful to defend against unlocked access to
> -                                * ->shrink_list in shmem_unused_huge_shrink()
> -                                */
> -                               if (list_empty_careful(&info->shrinklist)) {
> -                                       list_add_tail(&info->shrinklist,
> -                                                       &sbinfo->shrinklist);
> -                                       sbinfo->shrinklist_len++;
> -                               }
> -                               spin_unlock(&sbinfo->shrinklist_lock);
> -                       }
>                 }
>         }
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr()
@ 2021-07-30 21:50     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:31 AM Hugh Dickins <hughd@google.com> wrote:
>
> There's a block of code in shmem_setattr() to add the inode to
> shmem_unused_huge_shrink()'s shrinklist when lowering i_size: it dates
> from before 5.7 changed truncation to do split_huge_page() for itself,
> and should have been removed at that time.
>
> I am over-stating that: split_huge_page() can fail (notably if there's
> an extra reference to the page at that time), so there might be value in
> retrying.  But there were already retries as truncation worked through
> the tails, and this addition risks repeating unsuccessful retries
> indefinitely: I'd rather remove it now, and work on reducing the
> chance of split_huge_page() failures separately, if we need to.

Yes, agreed. Reviewed-by: Yang Shi <shy828301@gmail.com>

>
> Fixes: 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page() when punching hole")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/shmem.c | 19 -------------------
>  1 file changed, 19 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 24c9da6b41c2..ce3ccaac54d6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1061,7 +1061,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>  {
>         struct inode *inode = d_inode(dentry);
>         struct shmem_inode_info *info = SHMEM_I(inode);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         int error;
>
>         error = setattr_prepare(&init_user_ns, dentry, attr);
> @@ -1097,24 +1096,6 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
>                         if (oldsize > holebegin)
>                                 unmap_mapping_range(inode->i_mapping,
>                                                         holebegin, 0, 1);
> -
> -                       /*
> -                        * Part of the huge page can be beyond i_size: subject
> -                        * to shrink under memory pressure.
> -                        */
> -                       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -                               spin_lock(&sbinfo->shrinklist_lock);
> -                               /*
> -                                * _careful to defend against unlocked access to
> -                                * ->shrink_list in shmem_unused_huge_shrink()
> -                                */
> -                               if (list_empty_careful(&info->shrinklist)) {
> -                                       list_add_tail(&info->shrinklist,
> -                                                       &sbinfo->shrinklist);
> -                                       sbinfo->shrinklist_len++;
> -                               }
> -                               spin_unlock(&sbinfo->shrinklist_lock);
> -                       }
>                 }
>         }
>
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
  2021-07-30  7:36   ` Hugh Dickins
@ 2021-07-30 21:56     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
>
> 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> as a wrapper for two very different checks: shmem_huge_enabled() prefers
> to show those two checks explicitly, as before.

Basically I have no objection to separating them again. But IMHO they
seem not very different. Or just makes things easier for the following
patches?

>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/shmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ce3ccaac54d6..c6fa6f4f2db8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>         loff_t i_size;
>         pgoff_t off;
>
> -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         if (shmem_huge == SHMEM_HUGE_FORCE)
>                 return true;
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
@ 2021-07-30 21:56     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
>
> 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> as a wrapper for two very different checks: shmem_huge_enabled() prefers
> to show those two checks explicitly, as before.

Basically I have no objection to separating them again. But IMHO they
seem not very different. Or just makes things easier for the following
patches?

>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/shmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ce3ccaac54d6..c6fa6f4f2db8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
>         loff_t i_size;
>         pgoff_t off;
>
> -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>                 return false;
>         if (shmem_huge == SHMEM_HUGE_FORCE)
>                 return true;
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards
  2021-07-30  7:39   ` Hugh Dickins
@ 2021-07-30 21:57     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:39 AM Hugh Dickins <hughd@google.com> wrote:
>
> shmem_huge_enabled() is about to be enhanced into shmem_is_huge(),
> so that it can be used more widely throughout: before making functional
> changes, shift it to its final position (to avoid forward declaration).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/shmem.c | 72 ++++++++++++++++++++++++++----------------------------
>  1 file changed, 35 insertions(+), 37 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c6fa6f4f2db8..740d48ef1eb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -476,6 +476,41 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>
>  static int shmem_huge __read_mostly;
>
> +bool shmem_huge_enabled(struct vm_area_struct *vma)
> +{
> +       struct inode *inode = file_inode(vma->vm_file);
> +       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +       loff_t i_size;
> +       pgoff_t off;
> +
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +               return false;
> +       if (shmem_huge == SHMEM_HUGE_FORCE)
> +               return true;
> +       if (shmem_huge == SHMEM_HUGE_DENY)
> +               return false;
> +       switch (sbinfo->huge) {
> +       case SHMEM_HUGE_NEVER:
> +               return false;
> +       case SHMEM_HUGE_ALWAYS:
> +               return true;
> +       case SHMEM_HUGE_WITHIN_SIZE:
> +               off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> +               i_size = round_up(i_size_read(inode), PAGE_SIZE);
> +               if (i_size >= HPAGE_PMD_SIZE &&
> +                               i_size >> PAGE_SHIFT >= off)
> +                       return true;
> +               fallthrough;
> +       case SHMEM_HUGE_ADVISE:
> +               /* TODO: implement fadvise() hints */
> +               return (vma->vm_flags & VM_HUGEPAGE);
> +       default:
> +               VM_BUG_ON(1);
> +               return false;
> +       }
> +}
> +
>  #if defined(CONFIG_SYSFS)
>  static int shmem_parse_huge(const char *str)
>  {
> @@ -3995,43 +4030,6 @@ struct kobj_attribute shmem_enabled_attr =
>         __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -bool shmem_huge_enabled(struct vm_area_struct *vma)
> -{
> -       struct inode *inode = file_inode(vma->vm_file);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> -       loff_t i_size;
> -       pgoff_t off;
> -
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               return true;
> -       if (shmem_huge == SHMEM_HUGE_DENY)
> -               return false;
> -       switch (sbinfo->huge) {
> -               case SHMEM_HUGE_NEVER:
> -                       return false;
> -               case SHMEM_HUGE_ALWAYS:
> -                       return true;
> -               case SHMEM_HUGE_WITHIN_SIZE:
> -                       off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> -                       i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -                       if (i_size >= HPAGE_PMD_SIZE &&
> -                                       i_size >> PAGE_SHIFT >= off)
> -                               return true;
> -                       fallthrough;
> -               case SHMEM_HUGE_ADVISE:
> -                       /* TODO: implement fadvise() hints */
> -                       return (vma->vm_flags & VM_HUGEPAGE);
> -               default:
> -                       VM_BUG_ON(1);
> -                       return false;
> -       }
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -
>  #else /* !CONFIG_SHMEM */
>
>  /*
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards
@ 2021-07-30 21:57     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 21:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:39 AM Hugh Dickins <hughd@google.com> wrote:
>
> shmem_huge_enabled() is about to be enhanced into shmem_is_huge(),
> so that it can be used more widely throughout: before making functional
> changes, shift it to its final position (to avoid forward declaration).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/shmem.c | 72 ++++++++++++++++++++++++++----------------------------
>  1 file changed, 35 insertions(+), 37 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index c6fa6f4f2db8..740d48ef1eb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -476,6 +476,41 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>
>  static int shmem_huge __read_mostly;
>
> +bool shmem_huge_enabled(struct vm_area_struct *vma)
> +{
> +       struct inode *inode = file_inode(vma->vm_file);
> +       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +       loff_t i_size;
> +       pgoff_t off;
> +
> +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +               return false;
> +       if (shmem_huge == SHMEM_HUGE_FORCE)
> +               return true;
> +       if (shmem_huge == SHMEM_HUGE_DENY)
> +               return false;
> +       switch (sbinfo->huge) {
> +       case SHMEM_HUGE_NEVER:
> +               return false;
> +       case SHMEM_HUGE_ALWAYS:
> +               return true;
> +       case SHMEM_HUGE_WITHIN_SIZE:
> +               off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> +               i_size = round_up(i_size_read(inode), PAGE_SIZE);
> +               if (i_size >= HPAGE_PMD_SIZE &&
> +                               i_size >> PAGE_SHIFT >= off)
> +                       return true;
> +               fallthrough;
> +       case SHMEM_HUGE_ADVISE:
> +               /* TODO: implement fadvise() hints */
> +               return (vma->vm_flags & VM_HUGEPAGE);
> +       default:
> +               VM_BUG_ON(1);
> +               return false;
> +       }
> +}
> +
>  #if defined(CONFIG_SYSFS)
>  static int shmem_parse_huge(const char *str)
>  {
> @@ -3995,43 +4030,6 @@ struct kobj_attribute shmem_enabled_attr =
>         __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -bool shmem_huge_enabled(struct vm_area_struct *vma)
> -{
> -       struct inode *inode = file_inode(vma->vm_file);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> -       loff_t i_size;
> -       pgoff_t off;
> -
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               return true;
> -       if (shmem_huge == SHMEM_HUGE_DENY)
> -               return false;
> -       switch (sbinfo->huge) {
> -               case SHMEM_HUGE_NEVER:
> -                       return false;
> -               case SHMEM_HUGE_ALWAYS:
> -                       return true;
> -               case SHMEM_HUGE_WITHIN_SIZE:
> -                       off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> -                       i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -                       if (i_size >= HPAGE_PMD_SIZE &&
> -                                       i_size >> PAGE_SHIFT >= off)
> -                               return true;
> -                       fallthrough;
> -               case SHMEM_HUGE_ADVISE:
> -                       /* TODO: implement fadvise() hints */
> -                       return (vma->vm_flags & VM_HUGEPAGE);
> -               default:
> -                       VM_BUG_ON(1);
> -                       return false;
> -       }
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -
>  #else /* !CONFIG_SHMEM */
>
>  /*
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-07-30  7:42   ` Hugh Dickins
@ 2021-07-30 23:34     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
>
> Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> that a consistent set of checks can be applied, even when the inode is
> accessed through read/write syscalls (with NULL vma) instead of mmaps
> (the index argument is seldom of interest, but required by mount option
> "huge=within_size").  Clean up and rearrange the checks a little.
>
> This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> still true that khugepaged's collapse_file() at that point wants a small
> page, the race that might allocate it a huge page is too unlikely to be
> worth optimizing against (we are there *because* there was at least one
> small page in the way), and handled by a later PageTransCompound check.

Yes, it seems too unlikely. But if it happens the PageTransCompound
check may be not good enough since the page allocated by
shmem_getpage() may be charged to wrong memcg (root memcg). And it
won't be replaced by a newly allocated huge page so the wrong charge
can't be undone.

And, another question is it seems the newly allocated huge page will
just be uncharged instead of being freed until
"khugepaged_pages_to_scan" pages are scanned. The
khugepaged_prealloc_page() is called to free the allocated huge page
before each call to khugepaged_scan_mm_slot(). But
khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
may be called multiple times in the loop in khugepaged_scan_mm_slot(),
so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.

The code is quite convoluted, I'm not sure whether I miss something or
not. And this problem seems very hard to trigger in real life
workload.

>
> Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
> obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
> else needs that symlink check, so leave it there in shmem_getpage_gfp().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/shmem_fs.h |  9 +++--
>  mm/khugepaged.c          |  2 +-
>  mm/shmem.c               | 84 ++++++++++++----------------------------
>  3 files changed, 32 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9b7f7ac52351..3b05a28e34c4 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
>  extern int shmem_unuse(unsigned int type, bool frontswap,
>                        unsigned long *fs_pages_to_unuse);
>
> -extern bool shmem_huge_enabled(struct vm_area_struct *vma);
> +extern bool shmem_is_huge(struct vm_area_struct *vma,
> +                         struct inode *inode, pgoff_t index);
> +static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
> +{
> +       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
> +}
>  extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
>  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>                                                 pgoff_t start, pgoff_t end);
> @@ -95,8 +100,6 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
> -       SGP_NOHUGE,     /* like SGP_CACHE, but no huge pages */
> -       SGP_HUGE,       /* like SGP_CACHE, huge pages preferred */
>         SGP_WRITE,      /* may exceed i_size, may allocate !Uptodate page */
>         SGP_FALLOC,     /* like SGP_WRITE, but make existing page Uptodate */
>  };
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b0412be08fa2..cecb19c3e965 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_getpage(mapping->host, index, &page,
> -                                                 SGP_NOHUGE)) {
> +                                                 SGP_CACHE)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 740d48ef1eb5..6def7391084c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>
> -static int shmem_huge __read_mostly;
> +static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> -bool shmem_huge_enabled(struct vm_area_struct *vma)
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
>  {
> -       struct inode *inode = file_inode(vma->vm_file);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         loff_t i_size;
> -       pgoff_t off;
>
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               return true;
>         if (shmem_huge == SHMEM_HUGE_DENY)
>                 return false;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
>                 return false;
> +       if (shmem_huge == SHMEM_HUGE_FORCE)
> +               return true;
> +
> +       switch (SHMEM_SB(inode->i_sb)->huge) {
>         case SHMEM_HUGE_ALWAYS:
>                 return true;
>         case SHMEM_HUGE_WITHIN_SIZE:
> -               off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> +               index = round_up(index, HPAGE_PMD_NR);
>                 i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                               i_size >> PAGE_SHIFT >= off)
> +               if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
>                         return true;
>                 fallthrough;
>         case SHMEM_HUGE_ADVISE:
> -               /* TODO: implement fadvise() hints */
> -               return (vma->vm_flags & VM_HUGEPAGE);
> +               if (vma && (vma->vm_flags & VM_HUGEPAGE))
> +                       return true;
> +               fallthrough;
>         default:
> -               VM_BUG_ON(1);
>                 return false;
>         }
>  }
> @@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
>
>  #define shmem_huge SHMEM_HUGE_DENY
>
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
> +{
> +       return false;
> +}
> +
>  static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>                 struct shrink_control *sc, unsigned long nr_to_split)
>  {
> @@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>         struct shmem_sb_info *sbinfo;
>         struct mm_struct *charge_mm;
>         struct page *page;
> -       enum sgp_type sgp_huge = sgp;
>         pgoff_t hindex = index;
>         gfp_t huge_gfp;
>         int error;
> @@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>
>         if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
>                 return -EFBIG;
> -       if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
> -               sgp = SGP_CACHE;
>  repeat:
>         if (sgp <= SGP_CACHE &&
>             ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
> @@ -1898,36 +1897,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return 0;
>         }
>
> -       /* shmem_symlink() */
> -       if (!shmem_mapping(mapping))
> -               goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
> +       /* Never use a huge page for shmem_symlink() */
> +       if (S_ISLNK(inode->i_mode))
>                 goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               goto alloc_huge;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (!shmem_is_huge(vma, inode, index))
>                 goto alloc_nohuge;
> -       case SHMEM_HUGE_WITHIN_SIZE: {
> -               loff_t i_size;
> -               pgoff_t off;
> -
> -               off = round_up(index, HPAGE_PMD_NR);
> -               i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                   i_size >> PAGE_SHIFT >= off)
> -                       goto alloc_huge;
>
> -               fallthrough;
> -       }
> -       case SHMEM_HUGE_ADVISE:
> -               if (sgp_huge == SGP_HUGE)
> -                       goto alloc_huge;
> -               /* TODO: implement fadvise() hints */
> -               goto alloc_nohuge;
> -       }
> -
> -alloc_huge:
>         huge_gfp = vma_thp_gfp_mask(vma);
>         huge_gfp = limit_gfp_mask(huge_gfp, gfp);
>         page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
> @@ -2083,7 +2058,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         struct inode *inode = file_inode(vma->vm_file);
>         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -       enum sgp_type sgp;
>         int err;
>         vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2146,15 +2120,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       sgp = SGP_CACHE;
> -
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               sgp = SGP_NOHUGE;
> -       else if (vma->vm_flags & VM_HUGEPAGE)
> -               sgp = SGP_HUGE;
> -
> -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
> +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
> @@ -3961,7 +3927,7 @@ int __init shmem_init(void)
>         if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>                 SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>         else
> -               shmem_huge = 0; /* just in case it was patched */
> +               shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>  #endif
>         return 0;
>
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-07-30 23:34     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
>
> Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> that a consistent set of checks can be applied, even when the inode is
> accessed through read/write syscalls (with NULL vma) instead of mmaps
> (the index argument is seldom of interest, but required by mount option
> "huge=within_size").  Clean up and rearrange the checks a little.
>
> This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> still true that khugepaged's collapse_file() at that point wants a small
> page, the race that might allocate it a huge page is too unlikely to be
> worth optimizing against (we are there *because* there was at least one
> small page in the way), and handled by a later PageTransCompound check.

Yes, it seems too unlikely. But if it happens the PageTransCompound
check may be not good enough since the page allocated by
shmem_getpage() may be charged to wrong memcg (root memcg). And it
won't be replaced by a newly allocated huge page so the wrong charge
can't be undone.

And, another question is it seems the newly allocated huge page will
just be uncharged instead of being freed until
"khugepaged_pages_to_scan" pages are scanned. The
khugepaged_prealloc_page() is called to free the allocated huge page
before each call to khugepaged_scan_mm_slot(). But
khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
may be called multiple times in the loop in khugepaged_scan_mm_slot(),
so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.

The code is quite convoluted, I'm not sure whether I miss something or
not. And this problem seems very hard to trigger in real life
workload.

>
> Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the
> obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing
> else needs that symlink check, so leave it there in shmem_getpage_gfp().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/shmem_fs.h |  9 +++--
>  mm/khugepaged.c          |  2 +-
>  mm/shmem.c               | 84 ++++++++++++----------------------------
>  3 files changed, 32 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9b7f7ac52351..3b05a28e34c4 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -86,7 +86,12 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
>  extern int shmem_unuse(unsigned int type, bool frontswap,
>                        unsigned long *fs_pages_to_unuse);
>
> -extern bool shmem_huge_enabled(struct vm_area_struct *vma);
> +extern bool shmem_is_huge(struct vm_area_struct *vma,
> +                         struct inode *inode, pgoff_t index);
> +static inline bool shmem_huge_enabled(struct vm_area_struct *vma)
> +{
> +       return shmem_is_huge(vma, file_inode(vma->vm_file), vma->vm_pgoff);
> +}
>  extern unsigned long shmem_swap_usage(struct vm_area_struct *vma);
>  extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>                                                 pgoff_t start, pgoff_t end);
> @@ -95,8 +100,6 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
> -       SGP_NOHUGE,     /* like SGP_CACHE, but no huge pages */
> -       SGP_HUGE,       /* like SGP_CACHE, huge pages preferred */
>         SGP_WRITE,      /* may exceed i_size, may allocate !Uptodate page */
>         SGP_FALLOC,     /* like SGP_WRITE, but make existing page Uptodate */
>  };
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b0412be08fa2..cecb19c3e965 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_getpage(mapping->host, index, &page,
> -                                                 SGP_NOHUGE)) {
> +                                                 SGP_CACHE)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 740d48ef1eb5..6def7391084c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -474,39 +474,35 @@ static bool shmem_confirm_swap(struct address_space *mapping,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /* ifdef here to avoid bloating shmem.o when not necessary */
>
> -static int shmem_huge __read_mostly;
> +static int shmem_huge __read_mostly = SHMEM_HUGE_NEVER;
>
> -bool shmem_huge_enabled(struct vm_area_struct *vma)
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
>  {
> -       struct inode *inode = file_inode(vma->vm_file);
> -       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         loff_t i_size;
> -       pgoff_t off;
>
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               return false;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               return true;
>         if (shmem_huge == SHMEM_HUGE_DENY)
>                 return false;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
> +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)))
>                 return false;
> +       if (shmem_huge == SHMEM_HUGE_FORCE)
> +               return true;
> +
> +       switch (SHMEM_SB(inode->i_sb)->huge) {
>         case SHMEM_HUGE_ALWAYS:
>                 return true;
>         case SHMEM_HUGE_WITHIN_SIZE:
> -               off = round_up(vma->vm_pgoff, HPAGE_PMD_NR);
> +               index = round_up(index, HPAGE_PMD_NR);
>                 i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                               i_size >> PAGE_SHIFT >= off)
> +               if (i_size >= HPAGE_PMD_SIZE && (i_size >> PAGE_SHIFT) >= index)
>                         return true;
>                 fallthrough;
>         case SHMEM_HUGE_ADVISE:
> -               /* TODO: implement fadvise() hints */
> -               return (vma->vm_flags & VM_HUGEPAGE);
> +               if (vma && (vma->vm_flags & VM_HUGEPAGE))
> +                       return true;
> +               fallthrough;
>         default:
> -               VM_BUG_ON(1);
>                 return false;
>         }
>  }
> @@ -680,6 +676,12 @@ static long shmem_unused_huge_count(struct super_block *sb,
>
>  #define shmem_huge SHMEM_HUGE_DENY
>
> +bool shmem_is_huge(struct vm_area_struct *vma,
> +                  struct inode *inode, pgoff_t index)
> +{
> +       return false;
> +}
> +
>  static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>                 struct shrink_control *sc, unsigned long nr_to_split)
>  {
> @@ -1829,7 +1831,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>         struct shmem_sb_info *sbinfo;
>         struct mm_struct *charge_mm;
>         struct page *page;
> -       enum sgp_type sgp_huge = sgp;
>         pgoff_t hindex = index;
>         gfp_t huge_gfp;
>         int error;
> @@ -1838,8 +1839,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>
>         if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
>                 return -EFBIG;
> -       if (sgp == SGP_NOHUGE || sgp == SGP_HUGE)
> -               sgp = SGP_CACHE;
>  repeat:
>         if (sgp <= SGP_CACHE &&
>             ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
> @@ -1898,36 +1897,12 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return 0;
>         }
>
> -       /* shmem_symlink() */
> -       if (!shmem_mapping(mapping))
> -               goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_DENY || sgp_huge == SGP_NOHUGE)
> +       /* Never use a huge page for shmem_symlink() */
> +       if (S_ISLNK(inode->i_mode))
>                 goto alloc_nohuge;
> -       if (shmem_huge == SHMEM_HUGE_FORCE)
> -               goto alloc_huge;
> -       switch (sbinfo->huge) {
> -       case SHMEM_HUGE_NEVER:
> +       if (!shmem_is_huge(vma, inode, index))
>                 goto alloc_nohuge;
> -       case SHMEM_HUGE_WITHIN_SIZE: {
> -               loff_t i_size;
> -               pgoff_t off;
> -
> -               off = round_up(index, HPAGE_PMD_NR);
> -               i_size = round_up(i_size_read(inode), PAGE_SIZE);
> -               if (i_size >= HPAGE_PMD_SIZE &&
> -                   i_size >> PAGE_SHIFT >= off)
> -                       goto alloc_huge;
>
> -               fallthrough;
> -       }
> -       case SHMEM_HUGE_ADVISE:
> -               if (sgp_huge == SGP_HUGE)
> -                       goto alloc_huge;
> -               /* TODO: implement fadvise() hints */
> -               goto alloc_nohuge;
> -       }
> -
> -alloc_huge:
>         huge_gfp = vma_thp_gfp_mask(vma);
>         huge_gfp = limit_gfp_mask(huge_gfp, gfp);
>         page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
> @@ -2083,7 +2058,6 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>         struct vm_area_struct *vma = vmf->vma;
>         struct inode *inode = file_inode(vma->vm_file);
>         gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -       enum sgp_type sgp;
>         int err;
>         vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2146,15 +2120,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       sgp = SGP_CACHE;
> -
> -       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> -           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> -               sgp = SGP_NOHUGE;
> -       else if (vma->vm_flags & VM_HUGEPAGE)
> -               sgp = SGP_HUGE;
> -
> -       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp,
> +       err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
> @@ -3961,7 +3927,7 @@ int __init shmem_init(void)
>         if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY)
>                 SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>         else
> -               shmem_huge = 0; /* just in case it was patched */
> +               shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
>  #endif
>         return 0;
>
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge()
  2021-07-30  7:51   ` Hugh Dickins
@ 2021-07-30 23:40     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:51 AM Hugh Dickins <hughd@google.com> wrote:
>
> 4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
> page size if THP is on") added is_huge_enabled() to decide st_blksize:
> now that hugeness can be defined per file, that too needs to be replaced
> by shmem_is_huge().
>
> Unless they have been fcntl'ed F_HUGEPAGE, this does give a different
> answer (No) for small files on a "huge=within_size" mount: but that can
> be considered a minor bugfix.  And a different answer (No) for unfcntl'ed
> files on a "huge=advise" mount: I'm reluctant to complicate it, just to
> reproduce the same debatable answer as before.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/shmem.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 67a4b7a4849b..f50f2ede71da 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -712,15 +712,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
> -{
> -       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -           (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
> -           shmem_huge != SHMEM_HUGE_DENY)
> -               return true;
> -       return false;
> -}
> -
>  /*
>   * Like add_to_page_cache_locked, but error if expected item has gone.
>   */
> @@ -1101,7 +1092,6 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
>  {
>         struct inode *inode = path->dentry->d_inode;
>         struct shmem_inode_info *info = SHMEM_I(inode);
> -       struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb);
>
>         if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
>                 spin_lock_irq(&info->lock);
> @@ -1110,7 +1100,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
>         }
>         generic_fillattr(&init_user_ns, inode, stat);
>
> -       if (is_huge_enabled(sb_info))
> +       if (shmem_is_huge(NULL, inode, 0))
>                 stat->blksize = HPAGE_PMD_SIZE;
>
>         return 0;
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge()
@ 2021-07-30 23:40     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:51 AM Hugh Dickins <hughd@google.com> wrote:
>
> 4.18 commit 89fdcd262fd4 ("mm: shmem: make stat.st_blksize return huge
> page size if THP is on") added is_huge_enabled() to decide st_blksize:
> now that hugeness can be defined per file, that too needs to be replaced
> by shmem_is_huge().
>
> Unless they have been fcntl'ed F_HUGEPAGE, this does give a different
> answer (No) for small files on a "huge=within_size" mount: but that can
> be considered a minor bugfix.  And a different answer (No) for unfcntl'ed
> files on a "huge=advise" mount: I'm reluctant to complicate it, just to
> reproduce the same debatable answer as before.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/shmem.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 67a4b7a4849b..f50f2ede71da 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -712,15 +712,6 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
> -{
> -       if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -           (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
> -           shmem_huge != SHMEM_HUGE_DENY)
> -               return true;
> -       return false;
> -}
> -
>  /*
>   * Like add_to_page_cache_locked, but error if expected item has gone.
>   */
> @@ -1101,7 +1092,6 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
>  {
>         struct inode *inode = path->dentry->d_inode;
>         struct shmem_inode_info *info = SHMEM_I(inode);
> -       struct shmem_sb_info *sb_info = SHMEM_SB(inode->i_sb);
>
>         if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
>                 spin_lock_irq(&info->lock);
> @@ -1110,7 +1100,7 @@ static int shmem_getattr(struct user_namespace *mnt_userns,
>         }
>         generic_fillattr(&init_user_ns, inode, stat);
>
> -       if (is_huge_enabled(sb_info))
> +       if (shmem_is_huge(NULL, inode, 0))
>                 stat->blksize = HPAGE_PMD_SIZE;
>
>         return 0;
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
  2021-07-30  7:28   ` Hugh Dickins
@ 2021-07-30 23:48     ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:28 AM Hugh Dickins <hughd@google.com> wrote:
>
> A successful shmem_fallocate() guarantees that the extent has been
> reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
> But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
> split huge pages and free their excess beyond i_size; and by other uses
> of split_huge_page() near i_size.
>
> It's sad to add a shmem inode field just for this, but I did not find a
> better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
> would be cheaper, but I'm averse to unclearable flags.  The fallocend
> field is not perfect either (many disjoint ranges might be fallocated),
> but good enough; and gains another use later on.
>
> Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/linux/shmem_fs.h | 13 +++++++++++++
>  mm/huge_memory.c         |  6 ++++--
>  mm/shmem.c               | 15 ++++++++++++++-
>  3 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 8e775ce517bb..9b7f7ac52351 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -18,6 +18,7 @@ struct shmem_inode_info {
>         unsigned long           flags;
>         unsigned long           alloced;        /* data pages alloced to file */
>         unsigned long           swapped;        /* subtotal assigned to swap */
> +       pgoff_t                 fallocend;      /* highest fallocate endindex */
>         struct list_head        shrinklist;     /* shrinkable hpage inodes */
>         struct list_head        swaplist;       /* chain of maybes on swap */
>         struct shared_policy    policy;         /* NUMA memory alloc policy */
> @@ -119,6 +120,18 @@ static inline bool shmem_file(struct file *file)
>         return shmem_mapping(file->f_mapping);
>  }
>
> +/*
> + * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
> + * beyond i_size's notion of EOF, which fallocate has committed to reserving:
> + * which split_huge_page() must therefore not delete.  This use of a single
> + * "fallocend" per inode errs on the side of not deleting a reservation when
> + * in doubt: there are plenty of cases when it preserves unreserved pages.
> + */
> +static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
> +{
> +       return max(eof, SHMEM_I(inode)->fallocend);
> +}
> +
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index afff3ac87067..890fb73ac89b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2454,11 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>
>         for (i = nr - 1; i >= 1; i--) {
>                 __split_huge_page_tail(head, i, lruvec, list);
> -               /* Some pages can be beyond i_size: drop them from page cache */
> +               /* Some pages can be beyond EOF: drop them from page cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
>                         __delete_from_page_cache(head + i, NULL);
> -                       if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
> +                       if (shmem_mapping(head->mapping))
>                                 shmem_uncharge(head->mapping->host, 1);
>                         put_page(head + i);
>                 } else if (!PageAnon(page)) {
> @@ -2686,6 +2686,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>                  * head page lock is good enough to serialize the trimming.
>                  */
>                 end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
> +               if (shmem_mapping(mapping))
> +                       end = shmem_fallocend(mapping->host, end);
>         }
>
>         /*
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0cd5c9156457..24c9da6b41c2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -905,6 +905,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>         if (lend == -1)
>                 end = -1;       /* unsigned, so actually very big */
>
> +       if (info->fallocend > start && info->fallocend <= end && !unfalloc)
> +               info->fallocend = start;
> +
>         pagevec_init(&pvec);
>         index = start;
>         while (index < end && find_lock_entries(mapping, index, end - 1,
> @@ -2667,7 +2670,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         struct shmem_inode_info *info = SHMEM_I(inode);
>         struct shmem_falloc shmem_falloc;
> -       pgoff_t start, index, end;
> +       pgoff_t start, index, end, undo_fallocend;
>         int error;
>
>         if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> @@ -2736,6 +2739,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode->i_private = &shmem_falloc;
>         spin_unlock(&inode->i_lock);
>
> +       /*
> +        * info->fallocend is only relevant when huge pages might be
> +        * involved: to prevent split_huge_page() freeing fallocated
> +        * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
> +        */
> +       undo_fallocend = info->fallocend;
> +       if (info->fallocend < end)
> +               info->fallocend = end;
> +
>         for (index = start; index < end; ) {
>                 struct page *page;
>
> @@ -2750,6 +2762,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                 else
>                         error = shmem_getpage(inode, index, &page, SGP_FALLOC);
>                 if (error) {
> +                       info->fallocend = undo_fallocend;
>                         /* Remove the !PageUptodate pages we added */
>                         if (index > start) {
>                                 shmem_undo_range(inode,
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE
@ 2021-07-30 23:48     ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-07-30 23:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Jul 30, 2021 at 12:28 AM Hugh Dickins <hughd@google.com> wrote:
>
> A successful shmem_fallocate() guarantees that the extent has been
> reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
> But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
> split huge pages and free their excess beyond i_size; and by other uses
> of split_huge_page() near i_size.
>
> It's sad to add a shmem inode field just for this, but I did not find a
> better way to keep the guarantee.  A flag to say KEEP_SIZE has been used
> would be cheaper, but I'm averse to unclearable flags.  The fallocend
> field is not perfect either (many disjoint ranges might be fallocated),
> but good enough; and gains another use later on.
>
> Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
> Signed-off-by: Hugh Dickins <hughd@google.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  include/linux/shmem_fs.h | 13 +++++++++++++
>  mm/huge_memory.c         |  6 ++++--
>  mm/shmem.c               | 15 ++++++++++++++-
>  3 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 8e775ce517bb..9b7f7ac52351 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -18,6 +18,7 @@ struct shmem_inode_info {
>         unsigned long           flags;
>         unsigned long           alloced;        /* data pages alloced to file */
>         unsigned long           swapped;        /* subtotal assigned to swap */
> +       pgoff_t                 fallocend;      /* highest fallocate endindex */
>         struct list_head        shrinklist;     /* shrinkable hpage inodes */
>         struct list_head        swaplist;       /* chain of maybes on swap */
>         struct shared_policy    policy;         /* NUMA memory alloc policy */
> @@ -119,6 +120,18 @@ static inline bool shmem_file(struct file *file)
>         return shmem_mapping(file->f_mapping);
>  }
>
> +/*
> + * If fallocate(FALLOC_FL_KEEP_SIZE) has been used, there may be pages
> + * beyond i_size's notion of EOF, which fallocate has committed to reserving:
> + * which split_huge_page() must therefore not delete.  This use of a single
> + * "fallocend" per inode errs on the side of not deleting a reservation when
> + * in doubt: there are plenty of cases when it preserves unreserved pages.
> + */
> +static inline pgoff_t shmem_fallocend(struct inode *inode, pgoff_t eof)
> +{
> +       return max(eof, SHMEM_I(inode)->fallocend);
> +}
> +
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index afff3ac87067..890fb73ac89b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2454,11 +2454,11 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>
>         for (i = nr - 1; i >= 1; i--) {
>                 __split_huge_page_tail(head, i, lruvec, list);
> -               /* Some pages can be beyond i_size: drop them from page cache */
> +               /* Some pages can be beyond EOF: drop them from page cache */
>                 if (head[i].index >= end) {
>                         ClearPageDirty(head + i);
>                         __delete_from_page_cache(head + i, NULL);
> -                       if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
> +                       if (shmem_mapping(head->mapping))
>                                 shmem_uncharge(head->mapping->host, 1);
>                         put_page(head + i);
>                 } else if (!PageAnon(page)) {
> @@ -2686,6 +2686,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>                  * head page lock is good enough to serialize the trimming.
>                  */
>                 end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
> +               if (shmem_mapping(mapping))
> +                       end = shmem_fallocend(mapping->host, end);
>         }
>
>         /*
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0cd5c9156457..24c9da6b41c2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -905,6 +905,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
>         if (lend == -1)
>                 end = -1;       /* unsigned, so actually very big */
>
> +       if (info->fallocend > start && info->fallocend <= end && !unfalloc)
> +               info->fallocend = start;
> +
>         pagevec_init(&pvec);
>         index = start;
>         while (index < end && find_lock_entries(mapping, index, end - 1,
> @@ -2667,7 +2670,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>         struct shmem_inode_info *info = SHMEM_I(inode);
>         struct shmem_falloc shmem_falloc;
> -       pgoff_t start, index, end;
> +       pgoff_t start, index, end, undo_fallocend;
>         int error;
>
>         if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> @@ -2736,6 +2739,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>         inode->i_private = &shmem_falloc;
>         spin_unlock(&inode->i_lock);
>
> +       /*
> +        * info->fallocend is only relevant when huge pages might be
> +        * involved: to prevent split_huge_page() freeing fallocated
> +        * pages when FALLOC_FL_KEEP_SIZE committed beyond i_size.
> +        */
> +       undo_fallocend = info->fallocend;
> +       if (info->fallocend < end)
> +               info->fallocend = end;
> +
>         for (index = start; index < end; ) {
>                 struct page *page;
>
> @@ -2750,6 +2762,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>                 else
>                         error = shmem_getpage(inode, index, &page, SGP_FALLOC);
>                 if (error) {
> +                       info->fallocend = undo_fallocend;
>                         /* Remove the !PageUptodate pages we added */
>                         if (index > start) {
>                                 shmem_undo_range(inode,
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
  2021-07-30 21:36     ` Yang Shi
@ 2021-08-01  3:38       ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  3:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> > pages !Uptodate, partly to identify and undo them on failure, partly to
> > leave the overhead of clearing them until later.  But the huge page case
> > did not skip to the end of the extent, walked through the tail pages one
> > by one, and appeared to work just fine: but in doing so, cleared and
> > Uptodated the huge page, so there was no way to undo it on failure.
> >
> > Now advance immediately to the end of the huge extent, with a comment on
> > why this is more than just an optimization.  But although this speeds up
> > huge tmpfs fallocation, it does leave the clearing until first use, and
> > some users may have come to appreciate slow fallocate but fast first use:
> > if they complain, then we can consider adding a pass to clear at the end.
> >
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>

Many thanks for reviewing so many of these.

> 
> A nit below:
> 
> > ---
> >  mm/shmem.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 70d9ce294bb4..0cd5c9156457 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >         inode->i_private = &shmem_falloc;
> >         spin_unlock(&inode->i_lock);
> >
> > -       for (index = start; index < end; index++) {
> > +       for (index = start; index < end; ) {
> >                 struct page *page;
> >
> >                 /*
> > @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >                         goto undone;
> >                 }
> >
> > +               index++;
> > +               /*
> > +                * Here is a more important optimization than it appears:
> > +                * a second SGP_FALLOC on the same huge page will clear it,
> > +                * making it PageUptodate and un-undoable if we fail later.
> > +                */
> > +               if (PageTransCompound(page)) {
> > +                       index = round_up(index, HPAGE_PMD_NR);
> > +                       /* Beware 32-bit wraparound */
> > +                       if (!index)
> > +                               index--;
> > +               }
> > +
> >                 /*
> >                  * Inform shmem_writepage() how far we have reached.
> >                  * No need for lock or barrier: we have the page lock.
> >                  */
> > -               shmem_falloc.next++;
> >                 if (!PageUptodate(page))
> > -                       shmem_falloc.nr_falloced++;
> > +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> > +               shmem_falloc.next = index;
> 
> This also fixed the wrong accounting of nr_falloced, so it should be
> able to avoid returning -ENOMEM prematurely IIUC. Is it worth
> mentioning in the commit log?

It took me a long time to see your point there: ah yes, because it made
the whole huge page Uptodate when it reached the first tail, there would
have been only one nr_falloced++ for the whole of the huge page: well
spotted, thanks, I hadn't realized that.

Though I'm not so sure about your premature -ENOMEM: because once it has
made the huge page Uptodate, the other end (shmem_writepage()) will not
be incrementing nr_unswapped at all: so -ENOMEM would have been deferred
rather than premature, wouldn't it?

Add a comment on this in the commit log: yes, I guess so, but I haven't
worked out what to write yet.

Hugh

> 
> >
> >                 /*
> >                  * If !PageUptodate, leave it that way so that freeable pages
> > --
> > 2.26.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
@ 2021-08-01  3:38       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  3:38 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> > pages !Uptodate, partly to identify and undo them on failure, partly to
> > leave the overhead of clearing them until later.  But the huge page case
> > did not skip to the end of the extent, walked through the tail pages one
> > by one, and appeared to work just fine: but in doing so, cleared and
> > Uptodated the huge page, so there was no way to undo it on failure.
> >
> > Now advance immediately to the end of the huge extent, with a comment on
> > why this is more than just an optimization.  But although this speeds up
> > huge tmpfs fallocation, it does leave the clearing until first use, and
> > some users may have come to appreciate slow fallocate but fast first use:
> > if they complain, then we can consider adding a pass to clear at the end.
> >
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> 
> Reviewed-by: Yang Shi <shy828301@gmail.com>

Many thanks for reviewing so many of these.

> 
> A nit below:
> 
> > ---
> >  mm/shmem.c | 19 ++++++++++++++++---
> >  1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 70d9ce294bb4..0cd5c9156457 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >         inode->i_private = &shmem_falloc;
> >         spin_unlock(&inode->i_lock);
> >
> > -       for (index = start; index < end; index++) {
> > +       for (index = start; index < end; ) {
> >                 struct page *page;
> >
> >                 /*
> > @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >                         goto undone;
> >                 }
> >
> > +               index++;
> > +               /*
> > +                * Here is a more important optimization than it appears:
> > +                * a second SGP_FALLOC on the same huge page will clear it,
> > +                * making it PageUptodate and un-undoable if we fail later.
> > +                */
> > +               if (PageTransCompound(page)) {
> > +                       index = round_up(index, HPAGE_PMD_NR);
> > +                       /* Beware 32-bit wraparound */
> > +                       if (!index)
> > +                               index--;
> > +               }
> > +
> >                 /*
> >                  * Inform shmem_writepage() how far we have reached.
> >                  * No need for lock or barrier: we have the page lock.
> >                  */
> > -               shmem_falloc.next++;
> >                 if (!PageUptodate(page))
> > -                       shmem_falloc.nr_falloced++;
> > +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> > +               shmem_falloc.next = index;
> 
> This also fixed the wrong accounting of nr_falloced, so it should be
> able to avoid returning -ENOMEM prematurely IIUC. Is it worth
> mentioning in the commit log?

It took me a long time to see your point there: ah yes, because it made
the whole huge page Uptodate when it reached the first tail, there would
have been only one nr_falloced++ for the whole of the huge page: well
spotted, thanks, I hadn't realized that.

Though I'm not so sure about your premature -ENOMEM: because once it has
made the huge page Uptodate, the other end (shmem_writepage()) will not
be incrementing nr_unswapped at all: so -ENOMEM would have been deferred
rather than premature, wouldn't it?

Add a comment on this in the commit log: yes, I guess so, but I haven't
worked out what to write yet.

Hugh

> 
> >
> >                 /*
> >                  * If !PageUptodate, leave it that way so that freeable pages
> > --
> > 2.26.2


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
  2021-07-30 21:56     ` Yang Shi
@ 2021-08-01  4:01       ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  4:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> > checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> > as a wrapper for two very different checks: shmem_huge_enabled() prefers
> > to show those two checks explicitly, as before.
> 
> Basically I have no objection to separating them again. But IMHO they
> seem not very different. Or just makes things easier for the following
> patches?

Well, it made it easier to apply the patch I'd prepared earlier,
but that was not the point; and I thought it best to be upfront
about the reversion, rather than hiding it in the movement.

The end result of the two checks is the same (don't try for huge pages),
and they have been grouped together because they occurred together in
several places, and both rely on "vma".

But one check is whether the app has marked that address range not to use
THPs; and the other check is whether the process is running in a hierarchy
that has been marked never to use THPs (which just uses vma to get to mm
to get to mm->flags (whether current->mm would be more relevant is not an
argument I want to get into, I'm not at all sure)).

To me those are very different; and I'm particularly concerned to make
MMF_DISABLE_THP references visible, since it did not exist when Kirill
and I first implemented shmem huge pages, and I've tended to forget it:
but consider it more in this series.

Hugh

> 
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/shmem.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ce3ccaac54d6..c6fa6f4f2db8 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> >         loff_t i_size;
> >         pgoff_t off;
> >
> > -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> >                 return false;
> >         if (shmem_huge == SHMEM_HUGE_FORCE)
> >                 return true;
> > --
> > 2.26.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
@ 2021-08-01  4:01       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  4:01 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> > checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> > as a wrapper for two very different checks: shmem_huge_enabled() prefers
> > to show those two checks explicitly, as before.
> 
> Basically I have no objection to separating them again. But IMHO they
> seem not very different. Or just makes things easier for the following
> patches?

Well, it made it easier to apply the patch I'd prepared earlier,
but that was not the point; and I thought it best to be upfront
about the reversion, rather than hiding it in the movement.

The end result of the two checks is the same (don't try for huge pages),
and they have been grouped together because they occurred together in
several places, and both rely on "vma".

But one check is whether the app has marked that address range not to use
THPs; and the other check is whether the process is running in a hierarchy
that has been marked never to use THPs (which just uses vma to get to mm
to get to mm->flags (whether current->mm would be more relevant is not an
argument I want to get into, I'm not at all sure)).

To me those are very different; and I'm particularly concerned to make
MMF_DISABLE_THP references visible, since it did not exist when Kirill
and I first implemented shmem huge pages, and I've tended to forget it:
but consider it more in this series.

Hugh

> 
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/shmem.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index ce3ccaac54d6..c6fa6f4f2db8 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> >         loff_t i_size;
> >         pgoff_t off;
> >
> > -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> >                 return false;
> >         if (shmem_huge == SHMEM_HUGE_FORCE)
> >                 return true;
> > --
> > 2.26.2


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-07-30 23:34     ` Yang Shi
@ 2021-08-01  5:22       ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  5:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > that a consistent set of checks can be applied, even when the inode is
> > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > (the index argument is seldom of interest, but required by mount option
> > "huge=within_size").  Clean up and rearrange the checks a little.
> >
> > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > still true that khugepaged's collapse_file() at that point wants a small
> > page, the race that might allocate it a huge page is too unlikely to be
> > worth optimizing against (we are there *because* there was at least one
> > small page in the way), and handled by a later PageTransCompound check.
> 
> Yes, it seems too unlikely. But if it happens the PageTransCompound
> check may be not good enough since the page allocated by
> shmem_getpage() may be charged to wrong memcg (root memcg). And it
> won't be replaced by a newly allocated huge page so the wrong charge
> can't be undone.

Good point on the memcg charge: I hadn't thought of that.  Of course
it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
admit that a huge mischarge is hugely worse than a small mischarge.

We could fix it by making shmem_getpage_gfp() non-static, and pointing
to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
done, but I don't really want to make shmem_getpage_gfp() public just
for this, for two reasons.

One is that the huge race it just so unlikely; and a mischarge to root
is not the end of the world, so long as it's not reproducible.  It can
only happen on the very first page of the huge extent, and the prior
"Stop if extent has been truncated" check makes sure there was one
entry in the extent at that point: so the race with hole-punch can only
occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
looks up the page in the tree (and I say hole-punch not truncate,
because shmem_getpage()'s i_size check will reject when truncated).
I don't doubt that it could happen, but stand by not optimizing against.

Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
there is unhealthy for unrelated reasons, that I cannot afford to get
into sending patches for at this time: but some of our users found the
worst-case latencies in collapse_file() intolerable - shmem_getpage()
may be reading in from swap, while the locked head of the huge page
being built is in the page cache keeping other users waiting.  So,
I'd say there's something worse than memcg in that shmem_getpage(),
but fixing that cannot be a part of this series.

> 
> And, another question is it seems the newly allocated huge page will
> just be uncharged instead of being freed until
> "khugepaged_pages_to_scan" pages are scanned. The
> khugepaged_prealloc_page() is called to free the allocated huge page
> before each call to khugepaged_scan_mm_slot(). But
> khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> 
> The code is quite convoluted, I'm not sure whether I miss something or
> not. And this problem seems very hard to trigger in real life
> workload.

Just to clarify, those two paragraphs are not about this patch, but about
what happens to mm/khugepaged.c's newly allocated huge page, when collapse
fails for any reason.

Yes, the code is convoluted: that's because it takes very different paths
when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
and when not NUMA (when it can allocate the huge page at a good unlocked
moment, and carry it forward from one attempt to the next).

I don't like it at all, the two paths are confusing: sometimes I wonder
whether we should just remove the !CONFIG_NUMA path entirely; and other
times I wonder in the other direction, whether the CONFIG_NUMA=y path
ought to go the other way when it finds nr_node_ids is 1.  Undecided.

I'm confident that if you work through the two cases (thinking about
only one of them at once!), you'll find that the failure paths (not
to mention the successful paths) do actually work correctly without
leaking (well, maybe the !NUMA path can hold on to one huge page
indefinitely, I forget, but I wouldn't count that as leaking).

Collapse failure is not uncommon and leaking huge pages gets noticed.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-01  5:22       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  5:22 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Fri, 30 Jul 2021, Yang Shi wrote:
> On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > that a consistent set of checks can be applied, even when the inode is
> > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > (the index argument is seldom of interest, but required by mount option
> > "huge=within_size").  Clean up and rearrange the checks a little.
> >
> > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > still true that khugepaged's collapse_file() at that point wants a small
> > page, the race that might allocate it a huge page is too unlikely to be
> > worth optimizing against (we are there *because* there was at least one
> > small page in the way), and handled by a later PageTransCompound check.
> 
> Yes, it seems too unlikely. But if it happens the PageTransCompound
> check may be not good enough since the page allocated by
> shmem_getpage() may be charged to wrong memcg (root memcg). And it
> won't be replaced by a newly allocated huge page so the wrong charge
> can't be undone.

Good point on the memcg charge: I hadn't thought of that.  Of course
it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
admit that a huge mischarge is hugely worse than a small mischarge.

We could fix it by making shmem_getpage_gfp() non-static, and pointing
to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
done, but I don't really want to make shmem_getpage_gfp() public just
for this, for two reasons.

One is that the huge race it just so unlikely; and a mischarge to root
is not the end of the world, so long as it's not reproducible.  It can
only happen on the very first page of the huge extent, and the prior
"Stop if extent has been truncated" check makes sure there was one
entry in the extent at that point: so the race with hole-punch can only
occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
looks up the page in the tree (and I say hole-punch not truncate,
because shmem_getpage()'s i_size check will reject when truncated).
I don't doubt that it could happen, but stand by not optimizing against.

Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
there is unhealthy for unrelated reasons, that I cannot afford to get
into sending patches for at this time: but some of our users found the
worst-case latencies in collapse_file() intolerable - shmem_getpage()
may be reading in from swap, while the locked head of the huge page
being built is in the page cache keeping other users waiting.  So,
I'd say there's something worse than memcg in that shmem_getpage(),
but fixing that cannot be a part of this series.

> 
> And, another question is it seems the newly allocated huge page will
> just be uncharged instead of being freed until
> "khugepaged_pages_to_scan" pages are scanned. The
> khugepaged_prealloc_page() is called to free the allocated huge page
> before each call to khugepaged_scan_mm_slot(). But
> khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> 
> The code is quite convoluted, I'm not sure whether I miss something or
> not. And this problem seems very hard to trigger in real life
> workload.

Just to clarify, those two paragraphs are not about this patch, but about
what happens to mm/khugepaged.c's newly allocated huge page, when collapse
fails for any reason.

Yes, the code is convoluted: that's because it takes very different paths
when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
and when not NUMA (when it can allocate the huge page at a good unlocked
moment, and carry it forward from one attempt to the next).

I don't like it at all, the two paths are confusing: sometimes I wonder
whether we should just remove the !CONFIG_NUMA path entirely; and other
times I wonder in the other direction, whether the CONFIG_NUMA=y path
ought to go the other way when it finds nr_node_ids is 1.  Undecided.

I'm confident that if you work through the two cases (thinking about
only one of them at once!), you'll find that the failure paths (not
to mention the successful paths) do actually work correctly without
leaking (well, maybe the !NUMA path can hold on to one huge page
indefinitely, I forget, but I wouldn't count that as leaking).

Collapse failure is not uncommon and leaking huge pages gets noticed.

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-01  5:22       ` Hugh Dickins
@ 2021-08-01  5:37         ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  5:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Sat, 31 Jul 2021, Hugh Dickins wrote:
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > that a consistent set of checks can be applied, even when the inode is
> > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > (the index argument is seldom of interest, but required by mount option
> > > "huge=within_size").  Clean up and rearrange the checks a little.
> > >
> > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > still true that khugepaged's collapse_file() at that point wants a small
> > > page, the race that might allocate it a huge page is too unlikely to be
> > > worth optimizing against (we are there *because* there was at least one
> > > small page in the way), and handled by a later PageTransCompound check.
> > 
> > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > check may be not good enough since the page allocated by
> > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > won't be replaced by a newly allocated huge page so the wrong charge
> > can't be undone.
> 
> Good point on the memcg charge: I hadn't thought of that.  Of course
> it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> admit that a huge mischarge is hugely worse than a small mischarge.

Stupid me (and maybe I haven't given this enough consideration yet):
but, much better than SGP_NOHUGE, much better than SGP_CACHE, would be
SGP_READ there, wouldn't it?  Needs to beware of the NULL too, of course.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-01  5:37         ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-01  5:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Sat, 31 Jul 2021, Hugh Dickins wrote:
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > that a consistent set of checks can be applied, even when the inode is
> > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > (the index argument is seldom of interest, but required by mount option
> > > "huge=within_size").  Clean up and rearrange the checks a little.
> > >
> > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > still true that khugepaged's collapse_file() at that point wants a small
> > > page, the race that might allocate it a huge page is too unlikely to be
> > > worth optimizing against (we are there *because* there was at least one
> > > small page in the way), and handled by a later PageTransCompound check.
> > 
> > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > check may be not good enough since the page allocated by
> > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > won't be replaced by a newly allocated huge page so the wrong charge
> > can't be undone.
> 
> Good point on the memcg charge: I hadn't thought of that.  Of course
> it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> admit that a huge mischarge is hugely worse than a small mischarge.

Stupid me (and maybe I haven't given this enough consideration yet):
but, much better than SGP_NOHUGE, much better than SGP_CACHE, would be
SGP_READ there, wouldn't it?  Needs to beware of the NULL too, of course.

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
  2021-08-01  3:38       ` Hugh Dickins
@ 2021-08-02 20:36         ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 20:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 8:38 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> > > pages !Uptodate, partly to identify and undo them on failure, partly to
> > > leave the overhead of clearing them until later.  But the huge page case
> > > did not skip to the end of the extent, walked through the tail pages one
> > > by one, and appeared to work just fine: but in doing so, cleared and
> > > Uptodated the huge page, so there was no way to undo it on failure.
> > >
> > > Now advance immediately to the end of the huge extent, with a comment on
> > > why this is more than just an optimization.  But although this speeds up
> > > huge tmpfs fallocation, it does leave the clearing until first use, and
> > > some users may have come to appreciate slow fallocate but fast first use:
> > > if they complain, then we can consider adding a pass to clear at the end.
> > >
> > > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> >
> > Reviewed-by: Yang Shi <shy828301@gmail.com>
>
> Many thanks for reviewing so many of these.
>
> >
> > A nit below:
> >
> > > ---
> > >  mm/shmem.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 70d9ce294bb4..0cd5c9156457 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> > >         inode->i_private = &shmem_falloc;
> > >         spin_unlock(&inode->i_lock);
> > >
> > > -       for (index = start; index < end; index++) {
> > > +       for (index = start; index < end; ) {
> > >                 struct page *page;
> > >
> > >                 /*
> > > @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> > >                         goto undone;
> > >                 }
> > >
> > > +               index++;
> > > +               /*
> > > +                * Here is a more important optimization than it appears:
> > > +                * a second SGP_FALLOC on the same huge page will clear it,
> > > +                * making it PageUptodate and un-undoable if we fail later.
> > > +                */
> > > +               if (PageTransCompound(page)) {
> > > +                       index = round_up(index, HPAGE_PMD_NR);
> > > +                       /* Beware 32-bit wraparound */
> > > +                       if (!index)
> > > +                               index--;
> > > +               }
> > > +
> > >                 /*
> > >                  * Inform shmem_writepage() how far we have reached.
> > >                  * No need for lock or barrier: we have the page lock.
> > >                  */
> > > -               shmem_falloc.next++;
> > >                 if (!PageUptodate(page))
> > > -                       shmem_falloc.nr_falloced++;
> > > +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> > > +               shmem_falloc.next = index;
> >
> > This also fixed the wrong accounting of nr_falloced, so it should be
> > able to avoid returning -ENOMEM prematurely IIUC. Is it worth
> > mentioning in the commit log?
>
> It took me a long time to see your point there: ah yes, because it made
> the whole huge page Uptodate when it reached the first tail, there would
> have been only one nr_falloced++ for the whole of the huge page: well
> spotted, thanks, I hadn't realized that.
>
> Though I'm not so sure about your premature -ENOMEM: because once it has
> made the huge page Uptodate, the other end (shmem_writepage()) will not
> be incrementing nr_unswapped at all: so -ENOMEM would have been deferred
> rather than premature, wouldn't it?

Ah, ok, I didn't pay too much attention to how nr_unswapped is
incremented. Just thought nr_falloced will be incremented by 512
rather than 1, so it is more unlikely to return -ENOMEM.

>
> Add a comment on this in the commit log: yes, I guess so, but I haven't
> worked out what to write yet.
>
> Hugh
>
> >
> > >
> > >                 /*
> > >                  * If !PageUptodate, leave it that way so that freeable pages
> > > --
> > > 2.26.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages
@ 2021-08-02 20:36         ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 20:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 8:38 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:25 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > shmem_fallocate() goes to a lot of trouble to leave its newly allocated
> > > pages !Uptodate, partly to identify and undo them on failure, partly to
> > > leave the overhead of clearing them until later.  But the huge page case
> > > did not skip to the end of the extent, walked through the tail pages one
> > > by one, and appeared to work just fine: but in doing so, cleared and
> > > Uptodated the huge page, so there was no way to undo it on failure.
> > >
> > > Now advance immediately to the end of the huge extent, with a comment on
> > > why this is more than just an optimization.  But although this speeds up
> > > huge tmpfs fallocation, it does leave the clearing until first use, and
> > > some users may have come to appreciate slow fallocate but fast first use:
> > > if they complain, then we can consider adding a pass to clear at the end.
> > >
> > > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> >
> > Reviewed-by: Yang Shi <shy828301@gmail.com>
>
> Many thanks for reviewing so many of these.
>
> >
> > A nit below:
> >
> > > ---
> > >  mm/shmem.c | 19 ++++++++++++++++---
> > >  1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 70d9ce294bb4..0cd5c9156457 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2736,7 +2736,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> > >         inode->i_private = &shmem_falloc;
> > >         spin_unlock(&inode->i_lock);
> > >
> > > -       for (index = start; index < end; index++) {
> > > +       for (index = start; index < end; ) {
> > >                 struct page *page;
> > >
> > >                 /*
> > > @@ -2759,13 +2759,26 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> > >                         goto undone;
> > >                 }
> > >
> > > +               index++;
> > > +               /*
> > > +                * Here is a more important optimization than it appears:
> > > +                * a second SGP_FALLOC on the same huge page will clear it,
> > > +                * making it PageUptodate and un-undoable if we fail later.
> > > +                */
> > > +               if (PageTransCompound(page)) {
> > > +                       index = round_up(index, HPAGE_PMD_NR);
> > > +                       /* Beware 32-bit wraparound */
> > > +                       if (!index)
> > > +                               index--;
> > > +               }
> > > +
> > >                 /*
> > >                  * Inform shmem_writepage() how far we have reached.
> > >                  * No need for lock or barrier: we have the page lock.
> > >                  */
> > > -               shmem_falloc.next++;
> > >                 if (!PageUptodate(page))
> > > -                       shmem_falloc.nr_falloced++;
> > > +                       shmem_falloc.nr_falloced += index - shmem_falloc.next;
> > > +               shmem_falloc.next = index;
> >
> > This also fixed the wrong accounting of nr_falloced, so it should be
> > able to avoid returning -ENOMEM prematurely IIUC. Is it worth
> > mentioning in the commit log?
>
> It took me a long time to see your point there: ah yes, because it made
> the whole huge page Uptodate when it reached the first tail, there would
> have been only one nr_falloced++ for the whole of the huge page: well
> spotted, thanks, I hadn't realized that.
>
> Though I'm not so sure about your premature -ENOMEM: because once it has
> made the huge page Uptodate, the other end (shmem_writepage()) will not
> be incrementing nr_unswapped at all: so -ENOMEM would have been deferred
> rather than premature, wouldn't it?

Ah, ok, I didn't pay too much attention to how nr_unswapped is
incremented. Just thought nr_falloced will be incremented by 512
rather than 1, so it is more unlikely to return -ENOMEM.

>
> Add a comment on this in the commit log: yes, I guess so, but I haven't
> worked out what to write yet.
>
> Hugh
>
> >
> > >
> > >                 /*
> > >                  * If !PageUptodate, leave it that way so that freeable pages
> > > --
> > > 2.26.2


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
  2021-08-01  4:01       ` Hugh Dickins
@ 2021-08-02 20:39         ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 20:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 9:01 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> > > checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> > > as a wrapper for two very different checks: shmem_huge_enabled() prefers
> > > to show those two checks explicitly, as before.
> >
> > Basically I have no objection to separating them again. But IMHO they
> > seem not very different. Or just makes things easier for the following
> > patches?
>
> Well, it made it easier to apply the patch I'd prepared earlier,
> but that was not the point; and I thought it best to be upfront
> about the reversion, rather than hiding it in the movement.
>
> The end result of the two checks is the same (don't try for huge pages),
> and they have been grouped together because they occurred together in
> several places, and both rely on "vma".
>
> But one check is whether the app has marked that address range not to use
> THPs; and the other check is whether the process is running in a hierarchy
> that has been marked never to use THPs (which just uses vma to get to mm
> to get to mm->flags (whether current->mm would be more relevant is not an
> argument I want to get into, I'm not at all sure)).
>
> To me those are very different; and I'm particularly concerned to make
> MMF_DISABLE_THP references visible, since it did not exist when Kirill
> and I first implemented shmem huge pages, and I've tended to forget it:
> but consider it more in this series.

Yes, I agree one checks vma the other one checks mm, they are
different from this perspective. Anyway, as I said I have no objection
to this change. You could add Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
> Hugh
>
> >
> > >
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  mm/shmem.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index ce3ccaac54d6..c6fa6f4f2db8 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> > >         loff_t i_size;
> > >         pgoff_t off;
> > >
> > > -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> > > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> > >                 return false;
> > >         if (shmem_huge == SHMEM_HUGE_FORCE)
> > >                 return true;
> > > --
> > > 2.26.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled()
@ 2021-08-02 20:39         ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 20:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 9:01 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:36 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > 5.14 commit e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP
> > > checking in transparent_hugepage_enabled()") added transhuge_vma_enabled()
> > > as a wrapper for two very different checks: shmem_huge_enabled() prefers
> > > to show those two checks explicitly, as before.
> >
> > Basically I have no objection to separating them again. But IMHO they
> > seem not very different. Or just makes things easier for the following
> > patches?
>
> Well, it made it easier to apply the patch I'd prepared earlier,
> but that was not the point; and I thought it best to be upfront
> about the reversion, rather than hiding it in the movement.
>
> The end result of the two checks is the same (don't try for huge pages),
> and they have been grouped together because they occurred together in
> several places, and both rely on "vma".
>
> But one check is whether the app has marked that address range not to use
> THPs; and the other check is whether the process is running in a hierarchy
> that has been marked never to use THPs (which just uses vma to get to mm
> to get to mm->flags (whether current->mm would be more relevant is not an
> argument I want to get into, I'm not at all sure)).
>
> To me those are very different; and I'm particularly concerned to make
> MMF_DISABLE_THP references visible, since it did not exist when Kirill
> and I first implemented shmem huge pages, and I've tended to forget it:
> but consider it more in this series.

Yes, I agree one checks vma the other one checks mm, they are
different from this perspective. Anyway, as I said I have no objection
to this change. You could add Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
> Hugh
>
> >
> > >
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > ---
> > >  mm/shmem.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index ce3ccaac54d6..c6fa6f4f2db8 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -4003,7 +4003,8 @@ bool shmem_huge_enabled(struct vm_area_struct *vma)
> > >         loff_t i_size;
> > >         pgoff_t off;
> > >
> > > -       if (!transhuge_vma_enabled(vma, vma->vm_flags))
> > > +       if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > > +           test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> > >                 return false;
> > >         if (shmem_huge == SHMEM_HUGE_FORCE)
> > >                 return true;
> > > --
> > > 2.26.2


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-01  5:22       ` Hugh Dickins
@ 2021-08-02 21:14         ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 21:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > that a consistent set of checks can be applied, even when the inode is
> > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > (the index argument is seldom of interest, but required by mount option
> > > "huge=within_size").  Clean up and rearrange the checks a little.
> > >
> > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > still true that khugepaged's collapse_file() at that point wants a small
> > > page, the race that might allocate it a huge page is too unlikely to be
> > > worth optimizing against (we are there *because* there was at least one
> > > small page in the way), and handled by a later PageTransCompound check.
> >
> > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > check may be not good enough since the page allocated by
> > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > won't be replaced by a newly allocated huge page so the wrong charge
> > can't be undone.
>
> Good point on the memcg charge: I hadn't thought of that.  Of course
> it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> admit that a huge mischarge is hugely worse than a small mischarge.

The small page could be collapsed to a huge page sooner or later, so
the mischarge may be transient. But huge page can't be replaced.

>
> We could fix it by making shmem_getpage_gfp() non-static, and pointing
> to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> done, but I don't really want to make shmem_getpage_gfp() public just
> for this, for two reasons.
>
> One is that the huge race it just so unlikely; and a mischarge to root
> is not the end of the world, so long as it's not reproducible.  It can
> only happen on the very first page of the huge extent, and the prior

OK, if so the mischarge is not as bad as what I thought in the first place.

> "Stop if extent has been truncated" check makes sure there was one
> entry in the extent at that point: so the race with hole-punch can only
> occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> looks up the page in the tree (and I say hole-punch not truncate,
> because shmem_getpage()'s i_size check will reject when truncated).
> I don't doubt that it could happen, but stand by not optimizing against.

I agree the race is so unlikely and it may be not worth optimizing
against it right now, but a note or a comment may be worth.

>
> Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
> there is unhealthy for unrelated reasons, that I cannot afford to get
> into sending patches for at this time: but some of our users found the
> worst-case latencies in collapse_file() intolerable - shmem_getpage()
> may be reading in from swap, while the locked head of the huge page
> being built is in the page cache keeping other users waiting.  So,
> I'd say there's something worse than memcg in that shmem_getpage(),
> but fixing that cannot be a part of this series.

Yeah, that is a different problem.

>
> >
> > And, another question is it seems the newly allocated huge page will
> > just be uncharged instead of being freed until
> > "khugepaged_pages_to_scan" pages are scanned. The
> > khugepaged_prealloc_page() is called to free the allocated huge page
> > before each call to khugepaged_scan_mm_slot(). But
> > khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> > may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> > so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> >
> > The code is quite convoluted, I'm not sure whether I miss something or
> > not. And this problem seems very hard to trigger in real life
> > workload.
>
> Just to clarify, those two paragraphs are not about this patch, but about
> what happens to mm/khugepaged.c's newly allocated huge page, when collapse
> fails for any reason.
>
> Yes, the code is convoluted: that's because it takes very different paths
> when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
> and when not NUMA (when it can allocate the huge page at a good unlocked
> moment, and carry it forward from one attempt to the next).
>
> I don't like it at all, the two paths are confusing: sometimes I wonder
> whether we should just remove the !CONFIG_NUMA path entirely; and other
> times I wonder in the other direction, whether the CONFIG_NUMA=y path
> ought to go the other way when it finds nr_node_ids is 1.  Undecided.

I'm supposed it is just performance consideration to keep the
allocated huge page, but I'm not sure how much the difference would be
if we remove it (remove the !CONFIG_NUMA path) because the pcp could
cache THP now since Mel's patch 44042b449872 ("mm/page_alloc: allow
high-order pages to be stored on the per-cpu lists").

It seems to provide the similar optimization but in the buddy
allocator layer so that khugepaged doesn't have to maintain its own
implementation.

>
> I'm confident that if you work through the two cases (thinking about
> only one of them at once!), you'll find that the failure paths (not
> to mention the successful paths) do actually work correctly without
> leaking (well, maybe the !NUMA path can hold on to one huge page
> indefinitely, I forget, but I wouldn't count that as leaking).

IIUC the NUMA page could hold on to one huge page indefinitely. But
I've never seen the BUG personally, so maybe you are right.

>
> Collapse failure is not uncommon and leaking huge pages gets noticed.
>
> Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-02 21:14         ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-02 21:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Fri, 30 Jul 2021, Yang Shi wrote:
> > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > that a consistent set of checks can be applied, even when the inode is
> > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > (the index argument is seldom of interest, but required by mount option
> > > "huge=within_size").  Clean up and rearrange the checks a little.
> > >
> > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > still true that khugepaged's collapse_file() at that point wants a small
> > > page, the race that might allocate it a huge page is too unlikely to be
> > > worth optimizing against (we are there *because* there was at least one
> > > small page in the way), and handled by a later PageTransCompound check.
> >
> > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > check may be not good enough since the page allocated by
> > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > won't be replaced by a newly allocated huge page so the wrong charge
> > can't be undone.
>
> Good point on the memcg charge: I hadn't thought of that.  Of course
> it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> admit that a huge mischarge is hugely worse than a small mischarge.

The small page could be collapsed to a huge page sooner or later, so
the mischarge may be transient. But huge page can't be replaced.

>
> We could fix it by making shmem_getpage_gfp() non-static, and pointing
> to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> done, but I don't really want to make shmem_getpage_gfp() public just
> for this, for two reasons.
>
> One is that the huge race it just so unlikely; and a mischarge to root
> is not the end of the world, so long as it's not reproducible.  It can
> only happen on the very first page of the huge extent, and the prior

OK, if so the mischarge is not as bad as what I thought in the first place.

> "Stop if extent has been truncated" check makes sure there was one
> entry in the extent at that point: so the race with hole-punch can only
> occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> looks up the page in the tree (and I say hole-punch not truncate,
> because shmem_getpage()'s i_size check will reject when truncated).
> I don't doubt that it could happen, but stand by not optimizing against.

I agree the race is so unlikely and it may be not worth optimizing
against it right now, but a note or a comment may be worth.

>
> Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
> there is unhealthy for unrelated reasons, that I cannot afford to get
> into sending patches for at this time: but some of our users found the
> worst-case latencies in collapse_file() intolerable - shmem_getpage()
> may be reading in from swap, while the locked head of the huge page
> being built is in the page cache keeping other users waiting.  So,
> I'd say there's something worse than memcg in that shmem_getpage(),
> but fixing that cannot be a part of this series.

Yeah, that is a different problem.

>
> >
> > And, another question is it seems the newly allocated huge page will
> > just be uncharged instead of being freed until
> > "khugepaged_pages_to_scan" pages are scanned. The
> > khugepaged_prealloc_page() is called to free the allocated huge page
> > before each call to khugepaged_scan_mm_slot(). But
> > khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> > may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> > so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> >
> > The code is quite convoluted, I'm not sure whether I miss something or
> > not. And this problem seems very hard to trigger in real life
> > workload.
>
> Just to clarify, those two paragraphs are not about this patch, but about
> what happens to mm/khugepaged.c's newly allocated huge page, when collapse
> fails for any reason.
>
> Yes, the code is convoluted: that's because it takes very different paths
> when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
> and when not NUMA (when it can allocate the huge page at a good unlocked
> moment, and carry it forward from one attempt to the next).
>
> I don't like it at all, the two paths are confusing: sometimes I wonder
> whether we should just remove the !CONFIG_NUMA path entirely; and other
> times I wonder in the other direction, whether the CONFIG_NUMA=y path
> ought to go the other way when it finds nr_node_ids is 1.  Undecided.

I'm supposed it is just performance consideration to keep the
allocated huge page, but I'm not sure how much the difference would be
if we remove it (remove the !CONFIG_NUMA path) because the pcp could
cache THP now since Mel's patch 44042b449872 ("mm/page_alloc: allow
high-order pages to be stored on the per-cpu lists").

It seems to provide the similar optimization but in the buddy
allocator layer so that khugepaged doesn't have to maintain its own
implementation.

>
> I'm confident that if you work through the two cases (thinking about
> only one of them at once!), you'll find that the failure paths (not
> to mention the successful paths) do actually work correctly without
> leaking (well, maybe the !NUMA path can hold on to one huge page
> indefinitely, I forget, but I wouldn't count that as leaking).

IIUC the NUMA page could hold on to one huge page indefinitely. But
I've never seen the BUG personally, so maybe you are right.

>
> Collapse failure is not uncommon and leaking huge pages gets noticed.
>
> Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
  2021-07-30  7:55   ` Hugh Dickins
  (?)
@ 2021-08-03  1:38   ` Matthew Wilcox
  2021-08-04  9:15       ` Hugh Dickins
  -1 siblings, 1 reply; 91+ messages in thread
From: Matthew Wilcox @ 2021-08-03  1:38 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Eric W. Biederman, Alexey Gladkov,
	Chris Wilson, Matthew Auld, linux-fsdevel, linux-kernel,
	linux-api, linux-mm

On Fri, Jul 30, 2021 at 12:55:22AM -0700, Hugh Dickins wrote:
> A new uapi to lock the files on tmpfs in memory, to protect against swap
> without mapping the files. This commit introduces two new commands to
> fcntl and shmem: F_MEM_LOCK and F_MEM_UNLOCK. The locking will be
> charged against RLIMIT_MEMLOCK of uid in namespace of the caller.

It's not clear to me why this is limited to shmfs.  Would it not also
make sense for traditional filesystems, eg to force chrome's text pages
to stay in the page cache, no matter how much memory the tabs allocate?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-02 21:14         ` Yang Shi
@ 2021-08-04  8:28           ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-04  8:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Mon, 2 Aug 2021, Yang Shi wrote:
> On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > that a consistent set of checks can be applied, even when the inode is
> > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > (the index argument is seldom of interest, but required by mount option
> > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > >
> > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > worth optimizing against (we are there *because* there was at least one
> > > > small page in the way), and handled by a later PageTransCompound check.
> > >
> > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > check may be not good enough since the page allocated by
> > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > won't be replaced by a newly allocated huge page so the wrong charge
> > > can't be undone.
> >
> > Good point on the memcg charge: I hadn't thought of that.  Of course
> > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > admit that a huge mischarge is hugely worse than a small mischarge.
> 
> The small page could be collapsed to a huge page sooner or later, so
> the mischarge may be transient. But huge page can't be replaced.

You're right, if all goes well, the mischarged small page could be
collapsed to a correctly charged huge page sooner or later (but all
may not go well), whereas the mischarged huge page is stuck there.

> 
> >
> > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > done, but I don't really want to make shmem_getpage_gfp() public just
> > for this, for two reasons.
> >
> > One is that the huge race it just so unlikely; and a mischarge to root
> > is not the end of the world, so long as it's not reproducible.  It can
> > only happen on the very first page of the huge extent, and the prior
> 
> OK, if so the mischarge is not as bad as what I thought in the first place.
> 
> > "Stop if extent has been truncated" check makes sure there was one
> > entry in the extent at that point: so the race with hole-punch can only
> > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > looks up the page in the tree (and I say hole-punch not truncate,
> > because shmem_getpage()'s i_size check will reject when truncated).
> > I don't doubt that it could happen, but stand by not optimizing against.
> 
> I agree the race is so unlikely and it may be not worth optimizing
> against it right now, but a note or a comment may be worth.

Thanks, but despite us agreeing that the race is too unlikely to be worth
optimizing against, it does still nag at me ever since you questioned it:
silly, but I can't quite be convinced by my own dismissals.

I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
huge allocation decisions remains the intention; but now think to add
SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
to rule out that possibility of mischarge after racing hole-punch,
no matter whether it's huge or small.  If any such race occurs,
collapse_file() should just give up.

This being the "Stupid me" SGP_READ idea, except that of course would
not work: because half the point of that block in collapse_file() is
to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.

There is, of course, the danger that in fixing this unlikely mischarge,
I've got the code wrong and am introducing a bug: here's what a 17/16
would look like, though it will be better inserted early.  I got sick
of all the "if (page "s, and was glad of the opportunity to fix that
outdated "bring it back from swap" comment - swap got done above.

What do you think? Should I add this in or leave it out?

Thanks,
Hugh

--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_NOALLOC,	/* like SGP_READ, but do use fallocated page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 				xas_unlock_irq(&xas);
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
-						  SGP_CACHE)) {
+						  SGP_NOALLOC)) {
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return error;
 	}
 
-	if (page)
+	if (page) {
 		hindex = page->index;
-	if (page && sgp == SGP_WRITE)
-		mark_page_accessed(page);
-
-	/* fallocated page? */
-	if (page && !PageUptodate(page)) {
+		if (sgp == SGP_WRITE)
+			mark_page_accessed(page);
+		if (PageUptodate(page))
+			goto out;
+		/* fallocated page */
 		if (sgp != SGP_READ)
 			goto clear;
 		unlock_page(page);
 		put_page(page);
-		page = NULL;
-		hindex = index;
 	}
-	if (page || sgp == SGP_READ)
-		goto out;
+
+	*pagep = NULL;
+	if (sgp == SGP_READ)
+		return 0;
+	if (sgp == SGP_NOALLOC)
+		return -ENOENT;
 
 	/*
-	 * Fast cache lookup did not find it:
-	 * bring it back from swap or allocate.
+	 * Fast cache lookup and swap lookup did not find it: allocate.
 	 */
 
 	if (vma && userfaultfd_missing(vma)) {

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-04  8:28           ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-04  8:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Mon, 2 Aug 2021, Yang Shi wrote:
> On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > that a consistent set of checks can be applied, even when the inode is
> > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > (the index argument is seldom of interest, but required by mount option
> > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > >
> > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > worth optimizing against (we are there *because* there was at least one
> > > > small page in the way), and handled by a later PageTransCompound check.
> > >
> > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > check may be not good enough since the page allocated by
> > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > won't be replaced by a newly allocated huge page so the wrong charge
> > > can't be undone.
> >
> > Good point on the memcg charge: I hadn't thought of that.  Of course
> > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > admit that a huge mischarge is hugely worse than a small mischarge.
> 
> The small page could be collapsed to a huge page sooner or later, so
> the mischarge may be transient. But huge page can't be replaced.

You're right, if all goes well, the mischarged small page could be
collapsed to a correctly charged huge page sooner or later (but all
may not go well), whereas the mischarged huge page is stuck there.

> 
> >
> > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > done, but I don't really want to make shmem_getpage_gfp() public just
> > for this, for two reasons.
> >
> > One is that the huge race it just so unlikely; and a mischarge to root
> > is not the end of the world, so long as it's not reproducible.  It can
> > only happen on the very first page of the huge extent, and the prior
> 
> OK, if so the mischarge is not as bad as what I thought in the first place.
> 
> > "Stop if extent has been truncated" check makes sure there was one
> > entry in the extent at that point: so the race with hole-punch can only
> > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > looks up the page in the tree (and I say hole-punch not truncate,
> > because shmem_getpage()'s i_size check will reject when truncated).
> > I don't doubt that it could happen, but stand by not optimizing against.
> 
> I agree the race is so unlikely and it may be not worth optimizing
> against it right now, but a note or a comment may be worth.

Thanks, but despite us agreeing that the race is too unlikely to be worth
optimizing against, it does still nag at me ever since you questioned it:
silly, but I can't quite be convinced by my own dismissals.

I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
huge allocation decisions remains the intention; but now think to add
SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
to rule out that possibility of mischarge after racing hole-punch,
no matter whether it's huge or small.  If any such race occurs,
collapse_file() should just give up.

This being the "Stupid me" SGP_READ idea, except that of course would
not work: because half the point of that block in collapse_file() is
to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.

There is, of course, the danger that in fixing this unlikely mischarge,
I've got the code wrong and am introducing a bug: here's what a 17/16
would look like, though it will be better inserted early.  I got sick
of all the "if (page "s, and was glad of the opportunity to fix that
outdated "bring it back from swap" comment - swap got done above.

What do you think? Should I add this in or leave it out?

Thanks,
Hugh

--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 /* Flag allocation requirements to shmem_getpage */
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
+	SGP_NOALLOC,	/* like SGP_READ, but do use fallocated page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
 				xas_unlock_irq(&xas);
 				/* swap in or instantiate fallocated page */
 				if (shmem_getpage(mapping->host, index, &page,
-						  SGP_CACHE)) {
+						  SGP_NOALLOC)) {
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 		return error;
 	}
 
-	if (page)
+	if (page) {
 		hindex = page->index;
-	if (page && sgp == SGP_WRITE)
-		mark_page_accessed(page);
-
-	/* fallocated page? */
-	if (page && !PageUptodate(page)) {
+		if (sgp == SGP_WRITE)
+			mark_page_accessed(page);
+		if (PageUptodate(page))
+			goto out;
+		/* fallocated page */
 		if (sgp != SGP_READ)
 			goto clear;
 		unlock_page(page);
 		put_page(page);
-		page = NULL;
-		hindex = index;
 	}
-	if (page || sgp == SGP_READ)
-		goto out;
+
+	*pagep = NULL;
+	if (sgp == SGP_READ)
+		return 0;
+	if (sgp == SGP_NOALLOC)
+		return -ENOENT;
 
 	/*
-	 * Fast cache lookup did not find it:
-	 * bring it back from swap or allocate.
+	 * Fast cache lookup and swap lookup did not find it: allocate.
 	 */
 
 	if (vma && userfaultfd_missing(vma)) {


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
  2021-08-03  1:38   ` Matthew Wilcox
@ 2021-08-04  9:15       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-04  9:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Eric W. Biederman, Alexey Gladkov,
	Chris Wilson, Matthew Auld, linux-fsdevel, linux-kernel,
	linux-api, linux-mm

On Tue, 3 Aug 2021, Matthew Wilcox wrote:
> On Fri, Jul 30, 2021 at 12:55:22AM -0700, Hugh Dickins wrote:
> > A new uapi to lock the files on tmpfs in memory, to protect against swap
> > without mapping the files. This commit introduces two new commands to
> > fcntl and shmem: F_MEM_LOCK and F_MEM_UNLOCK. The locking will be
> > charged against RLIMIT_MEMLOCK of uid in namespace of the caller.
> 
> It's not clear to me why this is limited to shmfs.  Would it not also
> make sense for traditional filesystems, eg to force chrome's text pages
> to stay in the page cache, no matter how much memory the tabs allocate?

Right: if VFS people would like this to be available for all filesystems,
that's fine by me - it's just that we have not given thought to other
filesystems, and the demand was for tmpfs, so that was where to start.
I'm more confident adding fields to shmem inode than to generic inode.

(Plus tmpfs does have a stronger claim on CAP_IPC_LOCK etc, but there's
no real reason why that cannot be extended to similar use by other FSs).

hugetlbfs and ramfs, where the files are already memlocked?  Not worth a
special case, I think: if someone uses up memlock quota on them, so be it.

It looks as if tmpfs would still want its own special case, just to
handle the FALLOC_FL_KEEP_SIZE issue (see 12/16): tmpfs has beyond-i_size
pages in memory, but accounts them evictable; whereas I doubt any storage
filesystems would be using memory for them.

To be clear: I'm not intending to extend this to other filesystems at
the moment; but happy to do so if that's the consensus.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file
@ 2021-08-04  9:15       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-04  9:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Eric W. Biederman, Alexey Gladkov,
	Chris Wilson, Matthew Auld, linux-fsdevel, linux-kernel,
	linux-api, linux-mm

On Tue, 3 Aug 2021, Matthew Wilcox wrote:
> On Fri, Jul 30, 2021 at 12:55:22AM -0700, Hugh Dickins wrote:
> > A new uapi to lock the files on tmpfs in memory, to protect against swap
> > without mapping the files. This commit introduces two new commands to
> > fcntl and shmem: F_MEM_LOCK and F_MEM_UNLOCK. The locking will be
> > charged against RLIMIT_MEMLOCK of uid in namespace of the caller.
> 
> It's not clear to me why this is limited to shmfs.  Would it not also
> make sense for traditional filesystems, eg to force chrome's text pages
> to stay in the page cache, no matter how much memory the tabs allocate?

Right: if VFS people would like this to be available for all filesystems,
that's fine by me - it's just that we have not given thought to other
filesystems, and the demand was for tmpfs, so that was where to start.
I'm more confident adding fields to shmem inode than to generic inode.

(Plus tmpfs does have a stronger claim on CAP_IPC_LOCK etc, but there's
no real reason why that cannot be extended to similar use by other FSs).

hugetlbfs and ramfs, where the files are already memlocked?  Not worth a
special case, I think: if someone uses up memlock quota on them, so be it.

It looks as if tmpfs would still want its own special case, just to
handle the FALLOC_FL_KEEP_SIZE issue (see 12/16): tmpfs has beyond-i_size
pages in memory, but accounts them evictable; whereas I doubt any storage
filesystems would be using memory for them.

To be clear: I'm not intending to extend this to other filesystems at
the moment; but happy to do so if that's the consensus.

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
  2021-07-30  7:45   ` Hugh Dickins
  (?)
  (?)
@ 2021-08-04 14:03   ` Kirill A. Shutemov
  2021-08-06  3:33       ` Hugh Dickins
  -1 siblings, 1 reply; 91+ messages in thread
From: Kirill A. Shutemov @ 2021-08-04 14:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Fri, Jul 30, 2021 at 12:45:49AM -0700, Hugh Dickins wrote:
> Commit 749df87bd7be ("mm/shmem: add hugetlbfs support to memfd_create()")
> in 4.14 added the MFD_HUGETLB flag to memfd_create(), to use hugetlbfs
> pages instead of tmpfs pages: now add the MFD_HUGEPAGE flag, to use tmpfs
> Transparent Huge Pages when they can be allocated (flag named to follow
> the precedent of madvise's MADV_HUGEPAGE for THPs).

I don't like the interface. THP supposed to be transparent, not yet another
hugetlbs.

> /sys/kernel/mm/transparent_hugepage/shmem_enabled "always" or "force"
> already made this possible: but that is much too blunt an instrument,
> affecting all the very different kinds of files on the internal shmem
> mount, and was intended just for ease of testing hugepage loads.

I wounder if your tried "always" in production? What breaks? Maybe we can
make it work with a heuristic? This would speed up adoption.

If a tunable needed, I would rather go with fadvise(). It would operate on
a couple of bits per struct file and they get translated into VM_HUGEPAGE
and VM_NOHUGEPAGE on mmap().

Later if needed fadvise() implementation may be extended to track
requested ranges. But initially it can be simple.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
  2021-07-30  7:48   ` Hugh Dickins
  (?)
@ 2021-08-04 14:08   ` Kirill A. Shutemov
  2021-08-06  4:34       ` Hugh Dickins
  -1 siblings, 1 reply; 91+ messages in thread
From: Kirill A. Shutemov @ 2021-08-04 14:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Yang Shi,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Fri, Jul 30, 2021 at 12:48:33AM -0700, Hugh Dickins wrote:
> Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to
> select hugeness per file: useful to override the default hugeness of the
> shmem mount, when occasionally needing to store a hugepage file in a
> smallpage mount or vice versa.

Hm. But why is the new MFD_* needed if the fcntl() can do the same.

> These fcntls just specify whether or not to try for huge pages when
> allocating to the object later: F_HUGEPAGE does not touch small pages
> already allocated (though khugepaged may do so when the file is mapped
> afterwards), F_NOHUGEPAGE does not split huge pages already allocated.
> 
> Why fcntl?  Because it's already in use (for sealing) on memfds; and I'm
> anxious to keep this simple, just applying it to whole files: fallocate,
> madvise and posix_fadvise each involve a range, which would need a new
> kind of tree attached to the inode for proper support.

Most of fadvise() operations ignore the range. I like fadvise() because
it's less prescriptive: kernel is free to ignore it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-04  8:28           ` Hugh Dickins
@ 2021-08-04 19:01             ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-04 19:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 2 Aug 2021, Yang Shi wrote:
> > On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > > >
> > > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > > that a consistent set of checks can be applied, even when the inode is
> > > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > > (the index argument is seldom of interest, but required by mount option
> > > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > > >
> > > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > > worth optimizing against (we are there *because* there was at least one
> > > > > small page in the way), and handled by a later PageTransCompound check.
> > > >
> > > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > > check may be not good enough since the page allocated by
> > > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > > won't be replaced by a newly allocated huge page so the wrong charge
> > > > can't be undone.
> > >
> > > Good point on the memcg charge: I hadn't thought of that.  Of course
> > > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > > admit that a huge mischarge is hugely worse than a small mischarge.
> >
> > The small page could be collapsed to a huge page sooner or later, so
> > the mischarge may be transient. But huge page can't be replaced.
>
> You're right, if all goes well, the mischarged small page could be
> collapsed to a correctly charged huge page sooner or later (but all
> may not go well), whereas the mischarged huge page is stuck there.
>
> >
> > >
> > > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > > done, but I don't really want to make shmem_getpage_gfp() public just
> > > for this, for two reasons.
> > >
> > > One is that the huge race it just so unlikely; and a mischarge to root
> > > is not the end of the world, so long as it's not reproducible.  It can
> > > only happen on the very first page of the huge extent, and the prior
> >
> > OK, if so the mischarge is not as bad as what I thought in the first place.
> >
> > > "Stop if extent has been truncated" check makes sure there was one
> > > entry in the extent at that point: so the race with hole-punch can only
> > > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > > looks up the page in the tree (and I say hole-punch not truncate,
> > > because shmem_getpage()'s i_size check will reject when truncated).
> > > I don't doubt that it could happen, but stand by not optimizing against.
> >
> > I agree the race is so unlikely and it may be not worth optimizing
> > against it right now, but a note or a comment may be worth.
>
> Thanks, but despite us agreeing that the race is too unlikely to be worth
> optimizing against, it does still nag at me ever since you questioned it:
> silly, but I can't quite be convinced by my own dismissals.
>
> I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> huge allocation decisions remains the intention; but now think to add
> SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> to rule out that possibility of mischarge after racing hole-punch,
> no matter whether it's huge or small.  If any such race occurs,
> collapse_file() should just give up.
>
> This being the "Stupid me" SGP_READ idea, except that of course would
> not work: because half the point of that block in collapse_file() is
> to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
>
> There is, of course, the danger that in fixing this unlikely mischarge,
> I've got the code wrong and am introducing a bug: here's what a 17/16
> would look like, though it will be better inserted early.  I got sick
> of all the "if (page "s, and was glad of the opportunity to fix that
> outdated "bring it back from swap" comment - swap got done above.
>
> What do you think? Should I add this in or leave it out?

Thanks for keeping investigating this. The patch looks good to me. I
think we could go this way. Just a nit below.

>
> Thanks,
> Hugh
>
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  /* Flag allocation requirements to shmem_getpage */
>  enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
> +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */

The comment looks misleading, it seems SGP_NOALLOC does clear the
Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
this difference?

>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
>         SGP_WRITE,      /* may exceed i_size, may allocate !Uptodate page */
>         SGP_FALLOC,     /* like SGP_WRITE, but make existing page Uptodate */
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_getpage(mapping->host, index, &page,
> -                                                 SGP_CACHE)) {
> +                                                 SGP_NOALLOC)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return error;
>         }
>
> -       if (page)
> +       if (page) {
>                 hindex = page->index;
> -       if (page && sgp == SGP_WRITE)
> -               mark_page_accessed(page);
> -
> -       /* fallocated page? */
> -       if (page && !PageUptodate(page)) {
> +               if (sgp == SGP_WRITE)
> +                       mark_page_accessed(page);
> +               if (PageUptodate(page))
> +                       goto out;
> +               /* fallocated page */
>                 if (sgp != SGP_READ)
>                         goto clear;
>                 unlock_page(page);
>                 put_page(page);
> -               page = NULL;
> -               hindex = index;
>         }
> -       if (page || sgp == SGP_READ)
> -               goto out;
> +
> +       *pagep = NULL;
> +       if (sgp == SGP_READ)
> +               return 0;
> +       if (sgp == SGP_NOALLOC)
> +               return -ENOENT;
>
>         /*
> -        * Fast cache lookup did not find it:
> -        * bring it back from swap or allocate.
> +        * Fast cache lookup and swap lookup did not find it: allocate.
>          */
>
>         if (vma && userfaultfd_missing(vma)) {

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-04 19:01             ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-04 19:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 2 Aug 2021, Yang Shi wrote:
> > On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> > > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > > >
> > > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > > that a consistent set of checks can be applied, even when the inode is
> > > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > > (the index argument is seldom of interest, but required by mount option
> > > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > > >
> > > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > > worth optimizing against (we are there *because* there was at least one
> > > > > small page in the way), and handled by a later PageTransCompound check.
> > > >
> > > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > > check may be not good enough since the page allocated by
> > > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > > won't be replaced by a newly allocated huge page so the wrong charge
> > > > can't be undone.
> > >
> > > Good point on the memcg charge: I hadn't thought of that.  Of course
> > > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > > admit that a huge mischarge is hugely worse than a small mischarge.
> >
> > The small page could be collapsed to a huge page sooner or later, so
> > the mischarge may be transient. But huge page can't be replaced.
>
> You're right, if all goes well, the mischarged small page could be
> collapsed to a correctly charged huge page sooner or later (but all
> may not go well), whereas the mischarged huge page is stuck there.
>
> >
> > >
> > > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > > done, but I don't really want to make shmem_getpage_gfp() public just
> > > for this, for two reasons.
> > >
> > > One is that the huge race it just so unlikely; and a mischarge to root
> > > is not the end of the world, so long as it's not reproducible.  It can
> > > only happen on the very first page of the huge extent, and the prior
> >
> > OK, if so the mischarge is not as bad as what I thought in the first place.
> >
> > > "Stop if extent has been truncated" check makes sure there was one
> > > entry in the extent at that point: so the race with hole-punch can only
> > > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > > looks up the page in the tree (and I say hole-punch not truncate,
> > > because shmem_getpage()'s i_size check will reject when truncated).
> > > I don't doubt that it could happen, but stand by not optimizing against.
> >
> > I agree the race is so unlikely and it may be not worth optimizing
> > against it right now, but a note or a comment may be worth.
>
> Thanks, but despite us agreeing that the race is too unlikely to be worth
> optimizing against, it does still nag at me ever since you questioned it:
> silly, but I can't quite be convinced by my own dismissals.
>
> I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> huge allocation decisions remains the intention; but now think to add
> SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> to rule out that possibility of mischarge after racing hole-punch,
> no matter whether it's huge or small.  If any such race occurs,
> collapse_file() should just give up.
>
> This being the "Stupid me" SGP_READ idea, except that of course would
> not work: because half the point of that block in collapse_file() is
> to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
>
> There is, of course, the danger that in fixing this unlikely mischarge,
> I've got the code wrong and am introducing a bug: here's what a 17/16
> would look like, though it will be better inserted early.  I got sick
> of all the "if (page "s, and was glad of the opportunity to fix that
> outdated "bring it back from swap" comment - swap got done above.
>
> What do you think? Should I add this in or leave it out?

Thanks for keeping investigating this. The patch looks good to me. I
think we could go this way. Just a nit below.

>
> Thanks,
> Hugh
>
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
>  /* Flag allocation requirements to shmem_getpage */
>  enum sgp_type {
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
> +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */

The comment looks misleading, it seems SGP_NOALLOC does clear the
Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
this difference?

>         SGP_CACHE,      /* don't exceed i_size, may allocate page */
>         SGP_WRITE,      /* may exceed i_size, may allocate !Uptodate page */
>         SGP_FALLOC,     /* like SGP_WRITE, but make existing page Uptodate */
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1721,7 +1721,7 @@ static void collapse_file(struct mm_struct *mm,
>                                 xas_unlock_irq(&xas);
>                                 /* swap in or instantiate fallocated page */
>                                 if (shmem_getpage(mapping->host, index, &page,
> -                                                 SGP_CACHE)) {
> +                                                 SGP_NOALLOC)) {
>                                         result = SCAN_FAIL;
>                                         goto xa_unlocked;
>                                 }
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1903,26 +1903,27 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>                 return error;
>         }
>
> -       if (page)
> +       if (page) {
>                 hindex = page->index;
> -       if (page && sgp == SGP_WRITE)
> -               mark_page_accessed(page);
> -
> -       /* fallocated page? */
> -       if (page && !PageUptodate(page)) {
> +               if (sgp == SGP_WRITE)
> +                       mark_page_accessed(page);
> +               if (PageUptodate(page))
> +                       goto out;
> +               /* fallocated page */
>                 if (sgp != SGP_READ)
>                         goto clear;
>                 unlock_page(page);
>                 put_page(page);
> -               page = NULL;
> -               hindex = index;
>         }
> -       if (page || sgp == SGP_READ)
> -               goto out;
> +
> +       *pagep = NULL;
> +       if (sgp == SGP_READ)
> +               return 0;
> +       if (sgp == SGP_NOALLOC)
> +               return -ENOENT;
>
>         /*
> -        * Fast cache lookup did not find it:
> -        * bring it back from swap or allocate.
> +        * Fast cache lookup and swap lookup did not find it: allocate.
>          */
>
>         if (vma && userfaultfd_missing(vma)) {


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-02 21:14         ` Yang Shi
@ 2021-08-05 23:04           ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-05 23:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Mon, Aug 2, 2021 at 2:14 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > that a consistent set of checks can be applied, even when the inode is
> > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > (the index argument is seldom of interest, but required by mount option
> > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > >
> > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > worth optimizing against (we are there *because* there was at least one
> > > > small page in the way), and handled by a later PageTransCompound check.
> > >
> > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > check may be not good enough since the page allocated by
> > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > won't be replaced by a newly allocated huge page so the wrong charge
> > > can't be undone.
> >
> > Good point on the memcg charge: I hadn't thought of that.  Of course
> > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > admit that a huge mischarge is hugely worse than a small mischarge.
>
> The small page could be collapsed to a huge page sooner or later, so
> the mischarge may be transient. But huge page can't be replaced.
>
> >
> > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > done, but I don't really want to make shmem_getpage_gfp() public just
> > for this, for two reasons.
> >
> > One is that the huge race it just so unlikely; and a mischarge to root
> > is not the end of the world, so long as it's not reproducible.  It can
> > only happen on the very first page of the huge extent, and the prior
>
> OK, if so the mischarge is not as bad as what I thought in the first place.
>
> > "Stop if extent has been truncated" check makes sure there was one
> > entry in the extent at that point: so the race with hole-punch can only
> > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > looks up the page in the tree (and I say hole-punch not truncate,
> > because shmem_getpage()'s i_size check will reject when truncated).
> > I don't doubt that it could happen, but stand by not optimizing against.
>
> I agree the race is so unlikely and it may be not worth optimizing
> against it right now, but a note or a comment may be worth.
>
> >
> > Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
> > there is unhealthy for unrelated reasons, that I cannot afford to get
> > into sending patches for at this time: but some of our users found the
> > worst-case latencies in collapse_file() intolerable - shmem_getpage()
> > may be reading in from swap, while the locked head of the huge page
> > being built is in the page cache keeping other users waiting.  So,
> > I'd say there's something worse than memcg in that shmem_getpage(),
> > but fixing that cannot be a part of this series.
>
> Yeah, that is a different problem.
>
> >
> > >
> > > And, another question is it seems the newly allocated huge page will
> > > just be uncharged instead of being freed until
> > > "khugepaged_pages_to_scan" pages are scanned. The
> > > khugepaged_prealloc_page() is called to free the allocated huge page
> > > before each call to khugepaged_scan_mm_slot(). But
> > > khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> > > may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> > > so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> > >
> > > The code is quite convoluted, I'm not sure whether I miss something or
> > > not. And this problem seems very hard to trigger in real life
> > > workload.
> >
> > Just to clarify, those two paragraphs are not about this patch, but about
> > what happens to mm/khugepaged.c's newly allocated huge page, when collapse
> > fails for any reason.
> >
> > Yes, the code is convoluted: that's because it takes very different paths
> > when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
> > and when not NUMA (when it can allocate the huge page at a good unlocked
> > moment, and carry it forward from one attempt to the next).
> >
> > I don't like it at all, the two paths are confusing: sometimes I wonder
> > whether we should just remove the !CONFIG_NUMA path entirely; and other
> > times I wonder in the other direction, whether the CONFIG_NUMA=y path
> > ought to go the other way when it finds nr_node_ids is 1.  Undecided.
>
> I'm supposed it is just performance consideration to keep the
> allocated huge page, but I'm not sure how much the difference would be
> if we remove it (remove the !CONFIG_NUMA path) because the pcp could
> cache THP now since Mel's patch 44042b449872 ("mm/page_alloc: allow
> high-order pages to be stored on the per-cpu lists").
>
> It seems to provide the similar optimization but in the buddy
> allocator layer so that khugepaged doesn't have to maintain its own
> implementation.
>
> >
> > I'm confident that if you work through the two cases (thinking about
> > only one of them at once!), you'll find that the failure paths (not
> > to mention the successful paths) do actually work correctly without
> > leaking (well, maybe the !NUMA path can hold on to one huge page
> > indefinitely, I forget, but I wouldn't count that as leaking).
>
> IIUC the NUMA page could hold on to one huge page indefinitely. But
> I've never seen the BUG personally, so maybe you are right.

By rereading the code, I think you are correct. Both cases do work
correctly without leaking. And the !CONFIG_NUMA case may carry the
huge page indefinitely.

I think it is because khugepaged may collapse memory for another NUMA
node in the next loop, so it doesn't make too much sense to carry the
huge page, but it may be an optimization for !CONFIG_NUMA case.

However, as I mentioned in earlier email the new pcp implementation
could cache THP now, so we might not need keep this convoluted logic
anymore. Just free the page if collapse is failed then re-allocate
THP. The carried THP might improve the success rate a little bit but I
doubt how noticeable it would be, may be not worth for the extra
complexity at all.

>
> >
> > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-05 23:04           ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-05 23:04 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Mon, Aug 2, 2021 at 2:14 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sat, Jul 31, 2021 at 10:22 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Fri, 30 Jul 2021, Yang Shi wrote:
> > > On Fri, Jul 30, 2021 at 12:42 AM Hugh Dickins <hughd@google.com> wrote:
> > > >
> > > > Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so
> > > > that a consistent set of checks can be applied, even when the inode is
> > > > accessed through read/write syscalls (with NULL vma) instead of mmaps
> > > > (the index argument is seldom of interest, but required by mount option
> > > > "huge=within_size").  Clean up and rearrange the checks a little.
> > > >
> > > > This then replaces the checks which shmem_fault() and shmem_getpage_gfp()
> > > > were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes: while it's
> > > > still true that khugepaged's collapse_file() at that point wants a small
> > > > page, the race that might allocate it a huge page is too unlikely to be
> > > > worth optimizing against (we are there *because* there was at least one
> > > > small page in the way), and handled by a later PageTransCompound check.
> > >
> > > Yes, it seems too unlikely. But if it happens the PageTransCompound
> > > check may be not good enough since the page allocated by
> > > shmem_getpage() may be charged to wrong memcg (root memcg). And it
> > > won't be replaced by a newly allocated huge page so the wrong charge
> > > can't be undone.
> >
> > Good point on the memcg charge: I hadn't thought of that.  Of course
> > it's not specific to SGP_CACHE versus SGP_NOHUGE (this patch), but I
> > admit that a huge mischarge is hugely worse than a small mischarge.
>
> The small page could be collapsed to a huge page sooner or later, so
> the mischarge may be transient. But huge page can't be replaced.
>
> >
> > We could fix it by making shmem_getpage_gfp() non-static, and pointing
> > to the vma (hence its mm, hence its memcg) here, couldn't we?  Easily
> > done, but I don't really want to make shmem_getpage_gfp() public just
> > for this, for two reasons.
> >
> > One is that the huge race it just so unlikely; and a mischarge to root
> > is not the end of the world, so long as it's not reproducible.  It can
> > only happen on the very first page of the huge extent, and the prior
>
> OK, if so the mischarge is not as bad as what I thought in the first place.
>
> > "Stop if extent has been truncated" check makes sure there was one
> > entry in the extent at that point: so the race with hole-punch can only
> > occur after we xas_unlock_irq(&xas) immediately before shmem_getpage()
> > looks up the page in the tree (and I say hole-punch not truncate,
> > because shmem_getpage()'s i_size check will reject when truncated).
> > I don't doubt that it could happen, but stand by not optimizing against.
>
> I agree the race is so unlikely and it may be not worth optimizing
> against it right now, but a note or a comment may be worth.
>
> >
> > Other reason is that doing shmem_getpage() (or shmem_getpage_gfp())
> > there is unhealthy for unrelated reasons, that I cannot afford to get
> > into sending patches for at this time: but some of our users found the
> > worst-case latencies in collapse_file() intolerable - shmem_getpage()
> > may be reading in from swap, while the locked head of the huge page
> > being built is in the page cache keeping other users waiting.  So,
> > I'd say there's something worse than memcg in that shmem_getpage(),
> > but fixing that cannot be a part of this series.
>
> Yeah, that is a different problem.
>
> >
> > >
> > > And, another question is it seems the newly allocated huge page will
> > > just be uncharged instead of being freed until
> > > "khugepaged_pages_to_scan" pages are scanned. The
> > > khugepaged_prealloc_page() is called to free the allocated huge page
> > > before each call to khugepaged_scan_mm_slot(). But
> > > khugepaged_scan_file() -> collapse_fille() -> khugepaged_alloc_page()
> > > may be called multiple times in the loop in khugepaged_scan_mm_slot(),
> > > so khugepaged_alloc_page() may see that page to trigger VM_BUG IIUC.
> > >
> > > The code is quite convoluted, I'm not sure whether I miss something or
> > > not. And this problem seems very hard to trigger in real life
> > > workload.
> >
> > Just to clarify, those two paragraphs are not about this patch, but about
> > what happens to mm/khugepaged.c's newly allocated huge page, when collapse
> > fails for any reason.
> >
> > Yes, the code is convoluted: that's because it takes very different paths
> > when CONFIG_NUMA=y (when it cannot predict which node to allocate from)
> > and when not NUMA (when it can allocate the huge page at a good unlocked
> > moment, and carry it forward from one attempt to the next).
> >
> > I don't like it at all, the two paths are confusing: sometimes I wonder
> > whether we should just remove the !CONFIG_NUMA path entirely; and other
> > times I wonder in the other direction, whether the CONFIG_NUMA=y path
> > ought to go the other way when it finds nr_node_ids is 1.  Undecided.
>
> I'm supposed it is just performance consideration to keep the
> allocated huge page, but I'm not sure how much the difference would be
> if we remove it (remove the !CONFIG_NUMA path) because the pcp could
> cache THP now since Mel's patch 44042b449872 ("mm/page_alloc: allow
> high-order pages to be stored on the per-cpu lists").
>
> It seems to provide the similar optimization but in the buddy
> allocator layer so that khugepaged doesn't have to maintain its own
> implementation.
>
> >
> > I'm confident that if you work through the two cases (thinking about
> > only one of them at once!), you'll find that the failure paths (not
> > to mention the successful paths) do actually work correctly without
> > leaking (well, maybe the !NUMA path can hold on to one huge page
> > indefinitely, I forget, but I wouldn't count that as leaking).
>
> IIUC the NUMA page could hold on to one huge page indefinitely. But
> I've never seen the BUG personally, so maybe you are right.

By rereading the code, I think you are correct. Both cases do work
correctly without leaking. And the !CONFIG_NUMA case may carry the
huge page indefinitely.

I think it is because khugepaged may collapse memory for another NUMA
node in the next loop, so it doesn't make too much sense to carry the
huge page, but it may be an optimization for !CONFIG_NUMA case.

However, as I mentioned in earlier email the new pcp implementation
could cache THP now, so we might not need keep this convoluted logic
anymore. Just free the page if collapse is failed then re-allocate
THP. The carried THP might improve the success rate a little bit but I
doubt how noticeable it would be, may be not worth for the extra
complexity at all.

>
> >
> > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
  2021-08-04 14:03   ` Kirill A. Shutemov
@ 2021-08-06  3:33       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  3:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Wed, 4 Aug 2021, Kirill A. Shutemov wrote:
> On Fri, Jul 30, 2021 at 12:45:49AM -0700, Hugh Dickins wrote:
> > Commit 749df87bd7be ("mm/shmem: add hugetlbfs support to memfd_create()")
> > in 4.14 added the MFD_HUGETLB flag to memfd_create(), to use hugetlbfs
> > pages instead of tmpfs pages: now add the MFD_HUGEPAGE flag, to use tmpfs
> > Transparent Huge Pages when they can be allocated (flag named to follow
> > the precedent of madvise's MADV_HUGEPAGE for THPs).
> 
> I don't like the interface. THP supposed to be transparent, not yet another
> hugetlbs.

THP is transparent in the sense that it builds hugepages from the
normal page pool, when it can (or not when it cannot), rather than
promising hugepages from a separate pre-reserved hugetlbfs pool.

Not transparent in the sense that it cannot be limited or guided.

> 
> > /sys/kernel/mm/transparent_hugepage/shmem_enabled "always" or "force"
> > already made this possible: but that is much too blunt an instrument,
> > affecting all the very different kinds of files on the internal shmem
> > mount, and was intended just for ease of testing hugepage loads.
> 
> I wounder if your tried "always" in production? What breaks? Maybe we can
> make it work with a heuristic? This would speed up adoption.

We have not tried /sys/kernel/mm/transparent_hugepage/shmem_enabled
"always" in production.  Is that an experiment I want to recommend for
production?  No, I don't think so!  Why should we?

I am not looking to "speed up adoption" of huge tmpfs everywhere:
let those who find it useful use it, there is no need for it to be
used everywhere.

We have had this disagreement before: you were aiming for tmpfs on /tmp
huge=always, I didn't see the need for that; but we have always agreed
that it should not be broken there, and the better it works the better -
you did the unused_huge_shrink stuff in particular to meet such cases.

> 
> If a tunable needed, I would rather go with fadvise(). It would operate on
> a couple of bits per struct file and they get translated into VM_HUGEPAGE
> and VM_NOHUGEPAGE on mmap().
> 
> Later if needed fadvise() implementation may be extended to track
> requested ranges. But initially it can be simple.

Let me shift that to the 08/16 (fcntl) response, and here answer:

> Hm, But why is the MFD_* needed if the fcntl() can do the same.

You're right, MFD_HUGEPAGE (and MFD_MEM_LOCK) are not strictly
needed if there's an fcntl() or fadvise() which can do that too.

But MFD_HUGEPAGE is the option which was first asked for, and is
the most popular usage internally - I did the fcntl at the same time,
and it has been found useful, but MFD_HUGEPAGE was the priority
(largely because fiddling with shmem_enabled interferes with
everyone's different usages, whereas huge=always on a mount
can be deployed selectively).

And it makes good sense for memfd_create() to offer MFD_HUGEPAGE,
as it is already offering MFD_HUGETLB: when we document MFD_HUGEPAGE
next to MFD_HUGETLB in the memfd_create(2) man page, that will help
developers to make a good choice.

(You said MFD_*, so I take it that you're thinking of MFD_MEM_LOCK
too: MFD_MEM_LOCK is something I added when building this series,
when I realized that it became possible once size change permitted.
Nobody here is using it yet, I don't mind if it's dropped; but it's
natural to propose it as part of the series, and it can be justified
as offering the memlock option which MFD_HUGETLB already bundles in.)

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages
@ 2021-08-06  3:33       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  3:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Wed, 4 Aug 2021, Kirill A. Shutemov wrote:
> On Fri, Jul 30, 2021 at 12:45:49AM -0700, Hugh Dickins wrote:
> > Commit 749df87bd7be ("mm/shmem: add hugetlbfs support to memfd_create()")
> > in 4.14 added the MFD_HUGETLB flag to memfd_create(), to use hugetlbfs
> > pages instead of tmpfs pages: now add the MFD_HUGEPAGE flag, to use tmpfs
> > Transparent Huge Pages when they can be allocated (flag named to follow
> > the precedent of madvise's MADV_HUGEPAGE for THPs).
> 
> I don't like the interface. THP supposed to be transparent, not yet another
> hugetlbs.

THP is transparent in the sense that it builds hugepages from the
normal page pool, when it can (or not when it cannot), rather than
promising hugepages from a separate pre-reserved hugetlbfs pool.

Not transparent in the sense that it cannot be limited or guided.

> 
> > /sys/kernel/mm/transparent_hugepage/shmem_enabled "always" or "force"
> > already made this possible: but that is much too blunt an instrument,
> > affecting all the very different kinds of files on the internal shmem
> > mount, and was intended just for ease of testing hugepage loads.
> 
> I wounder if your tried "always" in production? What breaks? Maybe we can
> make it work with a heuristic? This would speed up adoption.

We have not tried /sys/kernel/mm/transparent_hugepage/shmem_enabled
"always" in production.  Is that an experiment I want to recommend for
production?  No, I don't think so!  Why should we?

I am not looking to "speed up adoption" of huge tmpfs everywhere:
let those who find it useful use it, there is no need for it to be
used everywhere.

We have had this disagreement before: you were aiming for tmpfs on /tmp
huge=always, I didn't see the need for that; but we have always agreed
that it should not be broken there, and the better it works the better -
you did the unused_huge_shrink stuff in particular to meet such cases.

> 
> If a tunable needed, I would rather go with fadvise(). It would operate on
> a couple of bits per struct file and they get translated into VM_HUGEPAGE
> and VM_NOHUGEPAGE on mmap().
> 
> Later if needed fadvise() implementation may be extended to track
> requested ranges. But initially it can be simple.

Let me shift that to the 08/16 (fcntl) response, and here answer:

> Hm, But why is the MFD_* needed if the fcntl() can do the same.

You're right, MFD_HUGEPAGE (and MFD_MEM_LOCK) are not strictly
needed if there's an fcntl() or fadvise() which can do that too.

But MFD_HUGEPAGE is the option which was first asked for, and is
the most popular usage internally - I did the fcntl at the same time,
and it has been found useful, but MFD_HUGEPAGE was the priority
(largely because fiddling with shmem_enabled interferes with
everyone's different usages, whereas huge=always on a mount
can be deployed selectively).

And it makes good sense for memfd_create() to offer MFD_HUGEPAGE,
as it is already offering MFD_HUGETLB: when we document MFD_HUGEPAGE
next to MFD_HUGETLB in the memfd_create(2) man page, that will help
developers to make a good choice.

(You said MFD_*, so I take it that you're thinking of MFD_MEM_LOCK
too: MFD_MEM_LOCK is something I added when building this series,
when I realized that it became possible once size change permitted.
Nobody here is using it yet, I don't mind if it's dropped; but it's
natural to propose it as part of the series, and it can be justified
as offering the memlock option which MFD_HUGETLB already bundles in.)

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
  2021-08-04 14:08   ` Kirill A. Shutemov
@ 2021-08-06  4:34       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  4:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Wed, 4 Aug 2021, Kirill A. Shutemov wrote:
> On Fri, Jul 30, 2021 at 12:48:33AM -0700, Hugh Dickins wrote:
> > Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to
> > select hugeness per file: useful to override the default hugeness of the
> > shmem mount, when occasionally needing to store a hugepage file in a
> > smallpage mount or vice versa.
> 
> Hm. But why is the new MFD_* needed if the fcntl() can do the same.

That I've just addressed in the MFD_HUGEPAGE 07/16 thread.

> 
> > These fcntls just specify whether or not to try for huge pages when
> > allocating to the object later: F_HUGEPAGE does not touch small pages
> > already allocated (though khugepaged may do so when the file is mapped
> > afterwards), F_NOHUGEPAGE does not split huge pages already allocated.
> > 
> > Why fcntl?  Because it's already in use (for sealing) on memfds; and I'm
> > anxious to keep this simple, just applying it to whole files: fallocate,
> > madvise and posix_fadvise each involve a range, which would need a new
> > kind of tree attached to the inode for proper support.
> 
> Most of fadvise() operations ignore the range. I like fadvise() because
> it's less prescriptive: kernel is free to ignore it.

As to ignoring the range, yes, I see now that some do; and I'm relieved
to see "Len == 0 means as much as possible", that's great, I was afraid
of compat bugs over 0xffy numbers for the len.  And we would want, not
to ignore the range, but insist on offset 0, len 0 for now, if there's
any intention (not mine) of extending it to ranges in the future.

As to ignoring the prescription, that's just a matter of how we describe
it in the manpage, no matter whether it's fadvise() or fcntl().

And in the 07/16 thread you also said:

> 
> If a tunable needed, I would rather go with fadvise(). It would operate on
> a couple of bits per struct file and they get translated into VM_HUGEPAGE
> and VM_NOHUGEPAGE on mmap().

Not so sure about that detail: the point here is to decide what kind
of allocations to try for, before the file is mmap()ed; and it is the
file (the underlying object) that I want to condition here, rather than
the struct file of who has it open at the time, or their mmap()s.

But adding the flags into the vm_flags on mmap(): that's an interesting
idea, I haven't played with that at all.  Offhand, I don't think it will
give different allocation results from what I'm already doing, but might
affect what is shown by default in /proc/<pid>/smaps.

> 
> Later if needed fadvise() implementation may be extended to track
> requested ranges. But initially it can be simple.

I still prefer fcntl() myself, but we can go with either: what I'd
like to hear is the preference of linux-fsdevel and linux-api people.

Aside from the unused offset+len, my main problem with fadvise()
is that... it doesn't exist.  It's posix_fadvise() or fadvise64() or
fadvise64_64(), and all its good advices are POSIX_MADV_whatever.

Are we comfortable now adding LINUX_MADV_HUGEPAGE, LINUX_MADV_NOHUGEPAGE?

I find myself singing 64 64 Zoo Lane.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE)
@ 2021-08-06  4:34       ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  4:34 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Yang Shi, Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld, linux-fsdevel,
	linux-kernel, linux-api, linux-mm

On Wed, 4 Aug 2021, Kirill A. Shutemov wrote:
> On Fri, Jul 30, 2021 at 12:48:33AM -0700, Hugh Dickins wrote:
> > Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to
> > select hugeness per file: useful to override the default hugeness of the
> > shmem mount, when occasionally needing to store a hugepage file in a
> > smallpage mount or vice versa.
> 
> Hm. But why is the new MFD_* needed if the fcntl() can do the same.

That I've just addressed in the MFD_HUGEPAGE 07/16 thread.

> 
> > These fcntls just specify whether or not to try for huge pages when
> > allocating to the object later: F_HUGEPAGE does not touch small pages
> > already allocated (though khugepaged may do so when the file is mapped
> > afterwards), F_NOHUGEPAGE does not split huge pages already allocated.
> > 
> > Why fcntl?  Because it's already in use (for sealing) on memfds; and I'm
> > anxious to keep this simple, just applying it to whole files: fallocate,
> > madvise and posix_fadvise each involve a range, which would need a new
> > kind of tree attached to the inode for proper support.
> 
> Most of fadvise() operations ignore the range. I like fadvise() because
> it's less prescriptive: kernel is free to ignore it.

As to ignoring the range, yes, I see now that some do; and I'm relieved
to see "Len == 0 means as much as possible", that's great, I was afraid
of compat bugs over 0xffy numbers for the len.  And we would want, not
to ignore the range, but insist on offset 0, len 0 for now, if there's
any intention (not mine) of extending it to ranges in the future.

As to ignoring the prescription, that's just a matter of how we describe
it in the manpage, no matter whether it's fadvise() or fcntl().

And in the 07/16 thread you also said:

> 
> If a tunable needed, I would rather go with fadvise(). It would operate on
> a couple of bits per struct file and they get translated into VM_HUGEPAGE
> and VM_NOHUGEPAGE on mmap().

Not so sure about that detail: the point here is to decide what kind
of allocations to try for, before the file is mmap()ed; and it is the
file (the underlying object) that I want to condition here, rather than
the struct file of who has it open at the time, or their mmap()s.

But adding the flags into the vm_flags on mmap(): that's an interesting
idea, I haven't played with that at all.  Offhand, I don't think it will
give different allocation results from what I'm already doing, but might
affect what is shown by default in /proc/<pid>/smaps.

> 
> Later if needed fadvise() implementation may be extended to track
> requested ranges. But initially it can be simple.

I still prefer fcntl() myself, but we can go with either: what I'd
like to hear is the preference of linux-fsdevel and linux-api people.

Aside from the unused offset+len, my main problem with fadvise()
is that... it doesn't exist.  It's posix_fadvise() or fadvise64() or
fadvise64_64(), and all its good advices are POSIX_MADV_whatever.

Are we comfortable now adding LINUX_MADV_HUGEPAGE, LINUX_MADV_NOHUGEPAGE?

I find myself singing 64 64 Zoo Lane.

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-04 19:01             ` Yang Shi
@ 2021-08-06  5:21               ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  5:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Wed, 4 Aug 2021, Yang Shi wrote:
> On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Thanks, but despite us agreeing that the race is too unlikely to be worth
> > optimizing against, it does still nag at me ever since you questioned it:
> > silly, but I can't quite be convinced by my own dismissals.
> >
> > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> > huge allocation decisions remains the intention; but now think to add
> > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> > to rule out that possibility of mischarge after racing hole-punch,
> > no matter whether it's huge or small.  If any such race occurs,
> > collapse_file() should just give up.
> >
> > This being the "Stupid me" SGP_READ idea, except that of course would
> > not work: because half the point of that block in collapse_file() is
> > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
> >
> > There is, of course, the danger that in fixing this unlikely mischarge,
> > I've got the code wrong and am introducing a bug: here's what a 17/16
> > would look like, though it will be better inserted early.  I got sick
> > of all the "if (page "s, and was glad of the opportunity to fix that
> > outdated "bring it back from swap" comment - swap got done above.
> >
> > What do you think? Should I add this in or leave it out?
> 
> Thanks for keeping investigating this. The patch looks good to me. I
> think we could go this way. Just a nit below.

Thanks, I'll add it into the series, a patch before SGP_NOHUGE goes away;
but I'm not intending to respin the series until there's more feedback
from others - fcntl versus fadvise is the main issue so far.

> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> >  /* Flag allocation requirements to shmem_getpage */
> >  enum sgp_type {
> >         SGP_READ,       /* don't exceed i_size, don't allocate page */
> > +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */
> 
> The comment looks misleading, it seems SGP_NOALLOC does clear the
> Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
> this difference?

I think you meant to say, SGP_NOALLOC does *set* the Uptodate flag but
SGP_READ doesn't.  And a more significant difference, as coded to suit
collapse_file(), is that SGP_NOALLOC returns failure on hole, whereas
SGP_READ returns success: I should have mentioned that.

When I wrote "like SGP_READ" there, I just meant "like what's said in
the line above": would "ditto" be okay with you, and I say
	SGP_NOALLOC,	/* ditto, but fail on hole, or use fallocated page */

I don't really want to get into the "Uptodate" business there.
And I'm afraid someone is going to ask me to write multi-line comments
on each of those SGP_flags, and I'm going to plead "read the source"!

Oh, now I see why you said SGP_NOALLOC does clear the Uptodate flag:
"goto clear", haha: when we clear the page we set the Uptodate flag.

And I may have another patch to slot in: I was half expecting you to
question why SGP_READ behaves as it does, so in preparing its defence
I checked, and found it was not doing quite what I remembered: changes
were made a long time ago, which have left it slightly suboptimal.
But that really has nothing to do with the rest of this series,
and I don't need to run it past you before reposting.

I hope that some of the features in this series can be useful to you.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-06  5:21               ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  5:21 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Wed, 4 Aug 2021, Yang Shi wrote:
> On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > Thanks, but despite us agreeing that the race is too unlikely to be worth
> > optimizing against, it does still nag at me ever since you questioned it:
> > silly, but I can't quite be convinced by my own dismissals.
> >
> > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> > huge allocation decisions remains the intention; but now think to add
> > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> > to rule out that possibility of mischarge after racing hole-punch,
> > no matter whether it's huge or small.  If any such race occurs,
> > collapse_file() should just give up.
> >
> > This being the "Stupid me" SGP_READ idea, except that of course would
> > not work: because half the point of that block in collapse_file() is
> > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
> >
> > There is, of course, the danger that in fixing this unlikely mischarge,
> > I've got the code wrong and am introducing a bug: here's what a 17/16
> > would look like, though it will be better inserted early.  I got sick
> > of all the "if (page "s, and was glad of the opportunity to fix that
> > outdated "bring it back from swap" comment - swap got done above.
> >
> > What do you think? Should I add this in or leave it out?
> 
> Thanks for keeping investigating this. The patch looks good to me. I
> think we could go this way. Just a nit below.

Thanks, I'll add it into the series, a patch before SGP_NOHUGE goes away;
but I'm not intending to respin the series until there's more feedback
from others - fcntl versus fadvise is the main issue so far.

> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> >  /* Flag allocation requirements to shmem_getpage */
> >  enum sgp_type {
> >         SGP_READ,       /* don't exceed i_size, don't allocate page */
> > +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */
> 
> The comment looks misleading, it seems SGP_NOALLOC does clear the
> Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
> this difference?

I think you meant to say, SGP_NOALLOC does *set* the Uptodate flag but
SGP_READ doesn't.  And a more significant difference, as coded to suit
collapse_file(), is that SGP_NOALLOC returns failure on hole, whereas
SGP_READ returns success: I should have mentioned that.

When I wrote "like SGP_READ" there, I just meant "like what's said in
the line above": would "ditto" be okay with you, and I say
	SGP_NOALLOC,	/* ditto, but fail on hole, or use fallocated page */

I don't really want to get into the "Uptodate" business there.
And I'm afraid someone is going to ask me to write multi-line comments
on each of those SGP_flags, and I'm going to plead "read the source"!

Oh, now I see why you said SGP_NOALLOC does clear the Uptodate flag:
"goto clear", haha: when we clear the page we set the Uptodate flag.

And I may have another patch to slot in: I was half expecting you to
question why SGP_READ behaves as it does, so in preparing its defence
I checked, and found it was not doing quite what I remembered: changes
were made a long time ago, which have left it slightly suboptimal.
But that really has nothing to do with the rest of this series,
and I don't need to run it past you before reposting.

I hope that some of the features in this series can be useful to you.

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-05 23:04           ` Yang Shi
@ 2021-08-06  5:43             ` Hugh Dickins
  -1 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  5:43 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Thu, 5 Aug 2021, Yang Shi wrote:
> 
> By rereading the code, I think you are correct. Both cases do work
> correctly without leaking. And the !CONFIG_NUMA case may carry the
> huge page indefinitely.
> 
> I think it is because khugepaged may collapse memory for another NUMA
> node in the next loop, so it doesn't make too much sense to carry the
> huge page, but it may be an optimization for !CONFIG_NUMA case.

Yes, that is its intention.

> 
> However, as I mentioned in earlier email the new pcp implementation
> could cache THP now, so we might not need keep this convoluted logic
> anymore. Just free the page if collapse is failed then re-allocate
> THP. The carried THP might improve the success rate a little bit but I
> doubt how noticeable it would be, may be not worth for the extra
> complexity at all.

It would be great if the new pcp implementation is good enough to
get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
the *hpage stuff too, I hope.  That would be a welcome cleanup.

> > > Collapse failure is not uncommon and leaking huge pages gets noticed.

After writing that, I realized how I'm almost always testing a NUMA=y
kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
So did so to check no leak, indeed; but was surprised, when comparing
vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
as the NUMA=y run.  I've merely made a note to look into that one day:
maybe it was just a one-off oddity, or maybe the incrementing of stats
is wrong down one path or the other.

Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-06  5:43             ` Hugh Dickins
  0 siblings, 0 replies; 91+ messages in thread
From: Hugh Dickins @ 2021-08-06  5:43 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Shakeel Butt, Kirill A. Shutemov,
	Miaohe Lin, Mike Kravetz, Michal Hocko, Rik van Riel,
	Christoph Hellwig, Matthew Wilcox, Eric W. Biederman,
	Alexey Gladkov, Chris Wilson, Matthew Auld,
	Linux FS-devel Mailing List, Linux Kernel Mailing List,
	linux-api, Linux MM

On Thu, 5 Aug 2021, Yang Shi wrote:
> 
> By rereading the code, I think you are correct. Both cases do work
> correctly without leaking. And the !CONFIG_NUMA case may carry the
> huge page indefinitely.
> 
> I think it is because khugepaged may collapse memory for another NUMA
> node in the next loop, so it doesn't make too much sense to carry the
> huge page, but it may be an optimization for !CONFIG_NUMA case.

Yes, that is its intention.

> 
> However, as I mentioned in earlier email the new pcp implementation
> could cache THP now, so we might not need keep this convoluted logic
> anymore. Just free the page if collapse is failed then re-allocate
> THP. The carried THP might improve the success rate a little bit but I
> doubt how noticeable it would be, may be not worth for the extra
> complexity at all.

It would be great if the new pcp implementation is good enough to
get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
the *hpage stuff too, I hope.  That would be a welcome cleanup.

> > > Collapse failure is not uncommon and leaking huge pages gets noticed.

After writing that, I realized how I'm almost always testing a NUMA=y
kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
So did so to check no leak, indeed; but was surprised, when comparing
vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
as the NUMA=y run.  I've merely made a note to look into that one day:
maybe it was just a one-off oddity, or maybe the incrementing of stats
is wrong down one path or the other.

Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-06  5:21               ` Hugh Dickins
@ 2021-08-06 17:41                 ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-06 17:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Thu, Aug 5, 2021 at 10:21 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 4 Aug 2021, Yang Shi wrote:
> > On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Thanks, but despite us agreeing that the race is too unlikely to be worth
> > > optimizing against, it does still nag at me ever since you questioned it:
> > > silly, but I can't quite be convinced by my own dismissals.
> > >
> > > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> > > huge allocation decisions remains the intention; but now think to add
> > > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> > > to rule out that possibility of mischarge after racing hole-punch,
> > > no matter whether it's huge or small.  If any such race occurs,
> > > collapse_file() should just give up.
> > >
> > > This being the "Stupid me" SGP_READ idea, except that of course would
> > > not work: because half the point of that block in collapse_file() is
> > > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
> > >
> > > There is, of course, the danger that in fixing this unlikely mischarge,
> > > I've got the code wrong and am introducing a bug: here's what a 17/16
> > > would look like, though it will be better inserted early.  I got sick
> > > of all the "if (page "s, and was glad of the opportunity to fix that
> > > outdated "bring it back from swap" comment - swap got done above.
> > >
> > > What do you think? Should I add this in or leave it out?
> >
> > Thanks for keeping investigating this. The patch looks good to me. I
> > think we could go this way. Just a nit below.
>
> Thanks, I'll add it into the series, a patch before SGP_NOHUGE goes away;
> but I'm not intending to respin the series until there's more feedback
> from others - fcntl versus fadvise is the main issue so far.

Thanks, yeah, no hurry to repost.

>
> > > --- a/include/linux/shmem_fs.h
> > > +++ b/include/linux/shmem_fs.h
> > > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> > >  /* Flag allocation requirements to shmem_getpage */
> > >  enum sgp_type {
> > >         SGP_READ,       /* don't exceed i_size, don't allocate page */
> > > +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */
> >
> > The comment looks misleading, it seems SGP_NOALLOC does clear the
> > Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
> > this difference?
>
> I think you meant to say, SGP_NOALLOC does *set* the Uptodate flag but
> SGP_READ doesn't.  And a more significant difference, as coded to suit
> collapse_file(), is that SGP_NOALLOC returns failure on hole, whereas
> SGP_READ returns success: I should have mentioned that.

Yes, I mean "set". Sorry for the confusion.

>
> When I wrote "like SGP_READ" there, I just meant "like what's said in
> the line above": would "ditto" be okay with you, and I say
>         SGP_NOALLOC,    /* ditto, but fail on hole, or use fallocated page */
>
> I don't really want to get into the "Uptodate" business there.
> And I'm afraid someone is going to ask me to write multi-line comments
> on each of those SGP_flags, and I'm going to plead "read the source"!

OK, I'm fine as is.

>
> Oh, now I see why you said SGP_NOALLOC does clear the Uptodate flag:
> "goto clear", haha: when we clear the page we set the Uptodate flag.
>
> And I may have another patch to slot in: I was half expecting you to
> question why SGP_READ behaves as it does, so in preparing its defence
> I checked, and found it was not doing quite what I remembered: changes
> were made a long time ago, which have left it slightly suboptimal.
> But that really has nothing to do with the rest of this series,
> and I don't need to run it past you before reposting.
>
> I hope that some of the features in this series can be useful to you.

Thanks, I will see.

>
> Thanks,
> Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-06 17:41                 ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-06 17:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Thu, Aug 5, 2021 at 10:21 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 4 Aug 2021, Yang Shi wrote:
> > On Wed, Aug 4, 2021 at 1:28 AM Hugh Dickins <hughd@google.com> wrote:
> > >
> > > Thanks, but despite us agreeing that the race is too unlikely to be worth
> > > optimizing against, it does still nag at me ever since you questioned it:
> > > silly, but I can't quite be convinced by my own dismissals.
> > >
> > > I do still want to get rid of SGP_HUGE and SGP_NOHUGE, clearing up those
> > > huge allocation decisions remains the intention; but now think to add
> > > SGP_NOALLOC for collapse_file() in place of SGP_NOHUGE or SGP_CACHE -
> > > to rule out that possibility of mischarge after racing hole-punch,
> > > no matter whether it's huge or small.  If any such race occurs,
> > > collapse_file() should just give up.
> > >
> > > This being the "Stupid me" SGP_READ idea, except that of course would
> > > not work: because half the point of that block in collapse_file() is
> > > to initialize the !Uptodate pages, whereas SGP_READ avoids doing so.
> > >
> > > There is, of course, the danger that in fixing this unlikely mischarge,
> > > I've got the code wrong and am introducing a bug: here's what a 17/16
> > > would look like, though it will be better inserted early.  I got sick
> > > of all the "if (page "s, and was glad of the opportunity to fix that
> > > outdated "bring it back from swap" comment - swap got done above.
> > >
> > > What do you think? Should I add this in or leave it out?
> >
> > Thanks for keeping investigating this. The patch looks good to me. I
> > think we could go this way. Just a nit below.
>
> Thanks, I'll add it into the series, a patch before SGP_NOHUGE goes away;
> but I'm not intending to respin the series until there's more feedback
> from others - fcntl versus fadvise is the main issue so far.

Thanks, yeah, no hurry to repost.

>
> > > --- a/include/linux/shmem_fs.h
> > > +++ b/include/linux/shmem_fs.h
> > > @@ -108,6 +108,7 @@ extern unsigned long shmem_partial_swap_usage(struct address_space *mapping,
> > >  /* Flag allocation requirements to shmem_getpage */
> > >  enum sgp_type {
> > >         SGP_READ,       /* don't exceed i_size, don't allocate page */
> > > +       SGP_NOALLOC,    /* like SGP_READ, but do use fallocated page */
> >
> > The comment looks misleading, it seems SGP_NOALLOC does clear the
> > Uptodate flag but SGP_READ doesn't. Or it is fine not to distinguish
> > this difference?
>
> I think you meant to say, SGP_NOALLOC does *set* the Uptodate flag but
> SGP_READ doesn't.  And a more significant difference, as coded to suit
> collapse_file(), is that SGP_NOALLOC returns failure on hole, whereas
> SGP_READ returns success: I should have mentioned that.

Yes, I mean "set". Sorry for the confusion.

>
> When I wrote "like SGP_READ" there, I just meant "like what's said in
> the line above": would "ditto" be okay with you, and I say
>         SGP_NOALLOC,    /* ditto, but fail on hole, or use fallocated page */
>
> I don't really want to get into the "Uptodate" business there.
> And I'm afraid someone is going to ask me to write multi-line comments
> on each of those SGP_flags, and I'm going to plead "read the source"!

OK, I'm fine as is.

>
> Oh, now I see why you said SGP_NOALLOC does clear the Uptodate flag:
> "goto clear", haha: when we clear the page we set the Uptodate flag.
>
> And I may have another patch to slot in: I was half expecting you to
> question why SGP_READ behaves as it does, so in preparing its defence
> I checked, and found it was not doing quite what I remembered: changes
> were made a long time ago, which have left it slightly suboptimal.
> But that really has nothing to do with the rest of this series,
> and I don't need to run it past you before reposting.
>
> I hope that some of the features in this series can be useful to you.

Thanks, I will see.

>
> Thanks,
> Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-06  5:43             ` Hugh Dickins
@ 2021-08-06 17:57               ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-06 17:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 5 Aug 2021, Yang Shi wrote:
> >
> > By rereading the code, I think you are correct. Both cases do work
> > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > huge page indefinitely.
> >
> > I think it is because khugepaged may collapse memory for another NUMA
> > node in the next loop, so it doesn't make too much sense to carry the
> > huge page, but it may be an optimization for !CONFIG_NUMA case.
>
> Yes, that is its intention.
>
> >
> > However, as I mentioned in earlier email the new pcp implementation
> > could cache THP now, so we might not need keep this convoluted logic
> > anymore. Just free the page if collapse is failed then re-allocate
> > THP. The carried THP might improve the success rate a little bit but I
> > doubt how noticeable it would be, may be not worth for the extra
> > complexity at all.
>
> It would be great if the new pcp implementation is good enough to
> get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> the *hpage stuff too, I hope.  That would be a welcome cleanup.

 The other question is if that optimization is worth it nowadays or
not. I bet not too many users build NUMA=n kernel nowadays even though
the kernel is actually running on a non-NUMA machine. Some small
devices may run NUMA=n kernel, but I don't think they actually use
THP. So such code complexity could be removed from this point of view
too.

>
> > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
>
> After writing that, I realized how I'm almost always testing a NUMA=y
> kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> So did so to check no leak, indeed; but was surprised, when comparing
> vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> as the NUMA=y run.  I've merely made a note to look into that one day:
> maybe it was just a one-off oddity, or maybe the incrementing of stats
> is wrong down one path or the other.

Yeah, probably.

>
> Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-06 17:57               ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-06 17:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 5 Aug 2021, Yang Shi wrote:
> >
> > By rereading the code, I think you are correct. Both cases do work
> > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > huge page indefinitely.
> >
> > I think it is because khugepaged may collapse memory for another NUMA
> > node in the next loop, so it doesn't make too much sense to carry the
> > huge page, but it may be an optimization for !CONFIG_NUMA case.
>
> Yes, that is its intention.
>
> >
> > However, as I mentioned in earlier email the new pcp implementation
> > could cache THP now, so we might not need keep this convoluted logic
> > anymore. Just free the page if collapse is failed then re-allocate
> > THP. The carried THP might improve the success rate a little bit but I
> > doubt how noticeable it would be, may be not worth for the extra
> > complexity at all.
>
> It would be great if the new pcp implementation is good enough to
> get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> the *hpage stuff too, I hope.  That would be a welcome cleanup.

 The other question is if that optimization is worth it nowadays or
not. I bet not too many users build NUMA=n kernel nowadays even though
the kernel is actually running on a non-NUMA machine. Some small
devices may run NUMA=n kernel, but I don't think they actually use
THP. So such code complexity could be removed from this point of view
too.

>
> > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
>
> After writing that, I realized how I'm almost always testing a NUMA=y
> kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> So did so to check no leak, indeed; but was surprised, when comparing
> vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> as the NUMA=y run.  I've merely made a note to look into that one day:
> maybe it was just a one-off oddity, or maybe the incrementing of stats
> is wrong down one path or the other.

Yeah, probably.

>
> Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
  2021-08-06 17:57               ` Yang Shi
@ 2021-08-12 18:19                 ` Yang Shi
  -1 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-12 18:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Aug 6, 2021 at 10:57 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > >
> > > By rereading the code, I think you are correct. Both cases do work
> > > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > > huge page indefinitely.
> > >
> > > I think it is because khugepaged may collapse memory for another NUMA
> > > node in the next loop, so it doesn't make too much sense to carry the
> > > huge page, but it may be an optimization for !CONFIG_NUMA case.
> >
> > Yes, that is its intention.
> >
> > >
> > > However, as I mentioned in earlier email the new pcp implementation
> > > could cache THP now, so we might not need keep this convoluted logic
> > > anymore. Just free the page if collapse is failed then re-allocate
> > > THP. The carried THP might improve the success rate a little bit but I
> > > doubt how noticeable it would be, may be not worth for the extra
> > > complexity at all.
> >
> > It would be great if the new pcp implementation is good enough to
> > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> > the *hpage stuff too, I hope.  That would be a welcome cleanup.
>
>  The other question is if that optimization is worth it nowadays or
> not. I bet not too many users build NUMA=n kernel nowadays even though
> the kernel is actually running on a non-NUMA machine. Some small
> devices may run NUMA=n kernel, but I don't think they actually use
> THP. So such code complexity could be removed from this point of view
> too.
>
> >
> > > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > After writing that, I realized how I'm almost always testing a NUMA=y
> > kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> > So did so to check no leak, indeed; but was surprised, when comparing
> > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> > as the NUMA=y run.  I've merely made a note to look into that one day:
> > maybe it was just a one-off oddity, or maybe the incrementing of stats
> > is wrong down one path or the other.

I came up with a patch to remove !CONFIG_NUMA case, and my test found
the same problem. NUMA=n run had done 5 times as much
thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as
what you saw.

A quick look shows the huge page allocation timing is different for
the two cases. For NUMA=n, the huge page is allocated by
khugepaged_prealloc_page() before scanning the address space, so it
means huge page may be allocated even though there is no suitable
range for collapsing. Then the page would be just freed if khugepaged
already made enough progress then try to reallocate again. The problem
should be more noticeable if you have a shorter scan interval
(scan_sleep_millisecs). I set it to 100ms for my test.

We could carry the huge page across scan passes for NUMA=n, but this
would make the code more complicated. I don't think it is really
worth, so just removing the special case for NUMA=n sounds more
reasonable to me.

>
> Yeah, probably.
>
> >
> > Hugh

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index)
@ 2021-08-12 18:19                 ` Yang Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Yang Shi @ 2021-08-12 18:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Shakeel Butt, Kirill A. Shutemov, Miaohe Lin,
	Mike Kravetz, Michal Hocko, Rik van Riel, Christoph Hellwig,
	Matthew Wilcox, Eric W. Biederman, Alexey Gladkov, Chris Wilson,
	Matthew Auld, Linux FS-devel Mailing List,
	Linux Kernel Mailing List, linux-api, Linux MM

On Fri, Aug 6, 2021 at 10:57 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Aug 5, 2021 at 10:43 PM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 5 Aug 2021, Yang Shi wrote:
> > >
> > > By rereading the code, I think you are correct. Both cases do work
> > > correctly without leaking. And the !CONFIG_NUMA case may carry the
> > > huge page indefinitely.
> > >
> > > I think it is because khugepaged may collapse memory for another NUMA
> > > node in the next loop, so it doesn't make too much sense to carry the
> > > huge page, but it may be an optimization for !CONFIG_NUMA case.
> >
> > Yes, that is its intention.
> >
> > >
> > > However, as I mentioned in earlier email the new pcp implementation
> > > could cache THP now, so we might not need keep this convoluted logic
> > > anymore. Just free the page if collapse is failed then re-allocate
> > > THP. The carried THP might improve the success rate a little bit but I
> > > doubt how noticeable it would be, may be not worth for the extra
> > > complexity at all.
> >
> > It would be great if the new pcp implementation is good enough to
> > get rid of khugepaged's confusing NUMA=y/NUMA=n differences; and all
> > the *hpage stuff too, I hope.  That would be a welcome cleanup.
>
>  The other question is if that optimization is worth it nowadays or
> not. I bet not too many users build NUMA=n kernel nowadays even though
> the kernel is actually running on a non-NUMA machine. Some small
> devices may run NUMA=n kernel, but I don't think they actually use
> THP. So such code complexity could be removed from this point of view
> too.
>
> >
> > > > > Collapse failure is not uncommon and leaking huge pages gets noticed.
> >
> > After writing that, I realized how I'm almost always testing a NUMA=y
> > kernel (though on non-NUMA machines), and seldom try the NUMA=n build.
> > So did so to check no leak, indeed; but was surprised, when comparing
> > vmstats, that the NUMA=n run had done 5 times as much thp_collapse_alloc
> > as the NUMA=y run.  I've merely made a note to look into that one day:
> > maybe it was just a one-off oddity, or maybe the incrementing of stats
> > is wrong down one path or the other.

I came up with a patch to remove !CONFIG_NUMA case, and my test found
the same problem. NUMA=n run had done 5 times as much
thp_collapse_alloc as NUMA=y run with vanilla kernel just exactly as
what you saw.

A quick look shows the huge page allocation timing is different for
the two cases. For NUMA=n, the huge page is allocated by
khugepaged_prealloc_page() before scanning the address space, so it
means huge page may be allocated even though there is no suitable
range for collapsing. Then the page would be just freed if khugepaged
already made enough progress then try to reallocate again. The problem
should be more noticeable if you have a shorter scan interval
(scan_sleep_millisecs). I set it to 100ms for my test.

We could carry the huge page across scan passes for NUMA=n, but this
would make the code more complicated. I don't think it is really
worth, so just removing the special case for NUMA=n sounds more
reasonable to me.

>
> Yeah, probably.
>
> >
> > Hugh


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2021-08-12 18:19 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-30  7:22 [PATCH 00/16] tmpfs: HUGEPAGE and MEM_LOCK fcntls and memfds Hugh Dickins
2021-07-30  7:22 ` Hugh Dickins
2021-07-30  7:25 ` [PATCH 01/16] huge tmpfs: fix fallocate(vanilla) advance over huge pages Hugh Dickins
2021-07-30  7:25   ` Hugh Dickins
2021-07-30 21:36   ` Yang Shi
2021-07-30 21:36     ` Yang Shi
2021-08-01  3:38     ` Hugh Dickins
2021-08-01  3:38       ` Hugh Dickins
2021-08-02 20:36       ` Yang Shi
2021-08-02 20:36         ` Yang Shi
2021-07-30  7:28 ` [PATCH 02/16] huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE Hugh Dickins
2021-07-30  7:28   ` Hugh Dickins
2021-07-30 23:48   ` Yang Shi
2021-07-30 23:48     ` Yang Shi
2021-07-30  7:30 ` [PATCH 03/16] huge tmpfs: remove shrinklist addition from shmem_setattr() Hugh Dickins
2021-07-30  7:30   ` Hugh Dickins
2021-07-30 21:50   ` Yang Shi
2021-07-30 21:50     ` Yang Shi
2021-07-30  7:36 ` [PATCH 04/16] huge tmpfs: revert shmem's use of transhuge_vma_enabled() Hugh Dickins
2021-07-30  7:36   ` Hugh Dickins
2021-07-30 21:56   ` Yang Shi
2021-07-30 21:56     ` Yang Shi
2021-08-01  4:01     ` Hugh Dickins
2021-08-01  4:01       ` Hugh Dickins
2021-08-02 20:39       ` Yang Shi
2021-08-02 20:39         ` Yang Shi
2021-07-30  7:39 ` [PATCH 05/16] huge tmpfs: move shmem_huge_enabled() upwards Hugh Dickins
2021-07-30  7:39   ` Hugh Dickins
2021-07-30 21:57   ` Yang Shi
2021-07-30 21:57     ` Yang Shi
2021-07-30  7:42 ` [PATCH 06/16] huge tmpfs: shmem_is_huge(vma, inode, index) Hugh Dickins
2021-07-30  7:42   ` Hugh Dickins
2021-07-30 23:34   ` Yang Shi
2021-07-30 23:34     ` Yang Shi
2021-08-01  5:22     ` Hugh Dickins
2021-08-01  5:22       ` Hugh Dickins
2021-08-01  5:37       ` Hugh Dickins
2021-08-01  5:37         ` Hugh Dickins
2021-08-02 21:14       ` Yang Shi
2021-08-02 21:14         ` Yang Shi
2021-08-04  8:28         ` Hugh Dickins
2021-08-04  8:28           ` Hugh Dickins
2021-08-04 19:01           ` Yang Shi
2021-08-04 19:01             ` Yang Shi
2021-08-06  5:21             ` Hugh Dickins
2021-08-06  5:21               ` Hugh Dickins
2021-08-06 17:41               ` Yang Shi
2021-08-06 17:41                 ` Yang Shi
2021-08-05 23:04         ` Yang Shi
2021-08-05 23:04           ` Yang Shi
2021-08-06  5:43           ` Hugh Dickins
2021-08-06  5:43             ` Hugh Dickins
2021-08-06 17:57             ` Yang Shi
2021-08-06 17:57               ` Yang Shi
2021-08-12 18:19               ` Yang Shi
2021-08-12 18:19                 ` Yang Shi
2021-07-30  7:45 ` [PATCH 07/16] memfd: memfd_create(name, MFD_HUGEPAGE) for shmem huge pages Hugh Dickins
2021-07-30  7:45   ` Hugh Dickins
2021-07-30 12:01   ` kernel test robot
2021-07-30 12:01     ` kernel test robot
2021-08-04 14:03   ` Kirill A. Shutemov
2021-08-06  3:33     ` Hugh Dickins
2021-08-06  3:33       ` Hugh Dickins
2021-07-30  7:48 ` [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE) Hugh Dickins
2021-07-30  7:48   ` Hugh Dickins
2021-08-04 14:08   ` Kirill A. Shutemov
2021-08-06  4:34     ` Hugh Dickins
2021-08-06  4:34       ` Hugh Dickins
2021-07-30  7:51 ` [PATCH 09/16] huge tmpfs: decide stat.st_blksize by shmem_is_huge() Hugh Dickins
2021-07-30  7:51   ` Hugh Dickins
2021-07-30 23:40   ` Yang Shi
2021-07-30 23:40     ` Yang Shi
2021-07-30  7:55 ` [PATCH 10/16] tmpfs: fcntl(fd, F_MEM_LOCK) to memlock a tmpfs file Hugh Dickins
2021-07-30  7:55   ` Hugh Dickins
2021-08-03  1:38   ` Matthew Wilcox
2021-08-04  9:15     ` Hugh Dickins
2021-08-04  9:15       ` Hugh Dickins
2021-07-30  7:57 ` [PATCH 11/16] tmpfs: fcntl(fd, F_MEM_LOCKED) to test if memlocked Hugh Dickins
2021-07-30  7:57   ` Hugh Dickins
2021-07-30  8:00 ` [PATCH 12/16] tmpfs: refuse memlock when fallocated beyond i_size Hugh Dickins
2021-07-30  8:00   ` Hugh Dickins
2021-07-30  8:03 ` [PATCH 13/16] mm: bool user_shm_lock(loff_t size, struct ucounts *) Hugh Dickins
2021-07-30  8:03   ` Hugh Dickins
2021-07-30  8:06 ` [PATCH 14/16] mm: user_shm_lock(,,getuc) and user_shm_unlock(,,putuc) Hugh Dickins
2021-07-30  8:06   ` Hugh Dickins
2021-07-30  8:09 ` [PATCH 15/16] tmpfs: permit changing size of memlocked file Hugh Dickins
2021-07-30  8:09   ` Hugh Dickins
2021-07-30  8:13 ` [PATCH 16/16] memfd: memfd_create(name, MFD_MEM_LOCK) for memlocked shmem Hugh Dickins
2021-07-30  8:13   ` Hugh Dickins
2021-07-30 11:24   ` kernel test robot
2021-07-30 11:24     ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.